+++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ CS8761: Assignment 4 - Report ----------------------------- Author: Amine Abou-Rjeili Date: 11/10/2002 Files included in tar file: --------------------------- select.pl - Splits the data into TRAIN and TEST set according to the given ratio feat.pl - Each line from the input file is split into 3 parts; key, sense, sentence, and the features extracted from the sentence part only. convert.pl - Converts a data file into feature representation nb.pl - Runs a Naive Bayesian Classifier on data. Changes madeas compared to previous version: 1) The probability of the sense that occurred the most times is displayed, in the case that there are no features and classification is based on the "most common sense" algorithm. 2) The probability is displayed using 15 decimal places instead of the mathematical 'e' format run-experiment - performs the experiments given a set of data. The data is split 70-30 TRAIN/TEST ratio and the experiments are carried out according to the constant and variable environments (see description below). The following files will be output: experiment.txt - Report of results FEAT - list of features TAGGING.0-1 - tagging detail for window 0 frequency cutoff 1 TAGGING.0-2 - tagging detail for window 0 frequency cutoff 2 TAGGING.0-5 - tagging detail for window 0 frequency cutoff 5 TAGGING.10-1 - tagging detail for window 10 frequency cutoff 1 TAGGING.10-2 - tagging detail for window 10 frequency cutoff 2 TAGGING.10-5 - tagging detail for window 10 frequency cutoff 5 TAGGING.2-1 - tagging detail for window 2 frequency cutoff 1 TAGGING.2-2 - tagging detail for window 2 frequency cutoff 2 TAGGING.2-5 - tagging detail for window 2 frequency cutoff 5 TAGGING.25-1 - tagging detail for window 25 frequency cutoff 1 TAGGING.25-2 - tagging detail for window 25 frequency cutoff 2 TAGGING.25-5 - tagging detail for window 25 frequency cutoff 5 target.config - config file TEST - test examples TEST.FV - test examples in feature format TRAIN - train examples TRAIN.FV - train examples in feature format Files added to get statistics and convert the Open Mind data to data appropriate to the classifier: count_words.pl - Counts the total number of words that have been tagged from the filenames given as command line arguments. This script is used to count how many words were tagged from the entire cs8761-umd.full.detailed file. find_good_words.pl - Calculates statistics from the data from the file as given on the command line. This is the cs8761-umd.full.detailed file. The statistics calculated are the criteria that have been specified in the assignment that make a "good" word. In addition, the script will filter out any duplicate and bad examples and split the input file into multiple files with each corresponding to a particular word. Each 'word' file will contain all examples for that particular word. Each example will occur only once. For example, if we assume that the cs8761-umd.full.detailed file contains examples for the words 'edge' 'energy' 'hope', then this script will create 3 files in the current directory called 'edge' 'energy' 'hope' that will correspond to each word from the main file. Each file will contain the examples from the detailed file except for the ones that have been filtered out. Examples will be filtered out according to the following rules: 1) an example has been tagged more than once and none of the tagged examples agree at least 2 times. 2) an example has been tagged more than 2 times and more than 1 sense has an agreement rate of 2 or more, then the sense with the highest agreement rate will be taken. The others will be dropped. 3) in case there is a tie with more than 2 senses in terms of agreement rate, the last sense will be picked and the rest dropped The following rule will create a file where each example ID occurs at most one time convert_data.pl - Converts the files as outputted from the above script into the format that the classifier understands. This is the same format as the line-data. However, the input format must be the same as outputted from the script find_good_words.pl For more information about the included scripts see the documentation provided at the beginning of the script. EXAMPLE RUN USING PROVIDED SCRIPTS: ----------------------------------- shell> find_good_words.pl /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.detailed > GOOD_WORDS2 Select words from the list of possible good words and run as follows e.g. aspect: shell> convert_data.pl aspect aspect /home/cs/tpederse/CS8761/Open-Mind/ids-to-sentences Now we will have a list of files corresponding to the senses of word 'aspect' in the appropriate format Optional: --------- shell> run-experiment all_senses_file This will run all the experiments on the provided senses file. EXPERIMENTS AND ANALYSIS: ------------------------- The task for this assignment is to use the Naive Bayesian classifier implemented in assignment 3 and run it using 3 "good" words from the data for the class obtained from the Open Mind project. A good word is one defined as follows (as taken from Assignment 4 specifications) * Has a reasonable rate of agreement among the two taggers. * Has a somewhat balanced distribution of senses. At the very least avoids the case where a single sense dominates the distribution. * Has a reasonable number of examples. To help in identifying "good" words, I created the script find_good_words.pl to print out some statistics and decompose the the examples from the main Open Mind data file into separate files for each word (see description above for more information) An output extract from a run of the script on the cs8761-umd.full.detailed file is as follows: 'Word' 'Agreement Count' 'Example Count' 'Number of senses' arc 64 3 - 1:19:00:: 1:25:00:: 1:25:01:: Sense Frequency 1:19:00:: = 7 1:25:00:: = 9 1:25:01:: = 48 argument 94 3 - 1:10:00:: 1:10:02:: 1:10:03:: Sense Frequency 1:10:00:: = 39 1:10:02:: = 32 1:10:03:: = 23 art 110 4 - 1:06:00:: 1:10:00:: 1:04:00:: 1:09:00:: Sense Frequency 1:06:00:: = 31 1:10:00:: = 19 1:04:00:: = 35 1:09:00:: = 25 aspect 100 5 - 1:24:00:: 1:07:02:: 1:09:01:: 1:09:00:: 1:07:01:: Sense Frequency 1:24:00:: = 3 1:07:02:: = 23 1:09:01:: = 15 1:09:00:: = 55 1:07:01:: = 4 The output has the following format: * The first line of each word is as follows: word total_number_of_occurences total_number_of_sense - list_of_senses The tab character (\t) is used as the field delimiter * The second line contains a list of the senses with a count of how many times that sense was encountered in the examples file Each sense is displayed on its own line and is indented with a tab character (\t) at the beginning of the line. These statistics show the sense distribution for a word together with the total number of available examples according to the filtering rules as specified above. However, these filtering rules introduce the following pitfall: * In the case that each example has been tagged ONLY ONCE then all the examples will be included, since there is nothing to contradict a particular example, so it is assumed to be correct. This is the case of the word 'art' from the cs8761-umd.full.detailed data. Therefor, in addition to looking at these statistics it is recommended to also check the actual data file. These statistics are meant to help in the process of identifying good words and not meant to actually do the identification. As mentioned above, this script will also produce a number of files for each word encountered in the main data file. After the find_good_words.pl script has been run, the output was examined to identify six good words. These are as follows (the statistics are also shown): * argument 94 3 - 1:10:00:: 1:10:02:: 1:10:03:: Sense Frequency 1:10:00:: = 39 1:10:02:: = 32 1:10:03:: = 23 * aspect 100 5 - 1:24:00:: 1:07:02:: 1:09:01:: 1:09:00:: 1:07:01:: Sense Frequency 1:24:00:: = 3 1:07:02:: = 23 1:09:01:: = 15 1:09:00:: = 55 1:07:01:: = 4 * edge 106 6 - 1:15:00:: 1:07:00:: 1:06:01:: 1:06:00:: 1:25:00:: 1:07:01:: Sense Frequency 1:15:00:: = 19 1:07:00:: = 38 1:06:01:: = 6 1:06:00:: = 15 1:25:00:: = 15 1:07:01:: = 13 * energy 148 6 - 1:14:00:: 1:07:00:: 1:19:00:: 1:26:00:: 1:07:02:: 1:07:01:: Sense Frequency 1:14:00:: = 6 1:07:00:: = 51 1:19:00:: = 67 1:26:00:: = 12 1:07:02:: = 6 1:07:01:: = 6 * hope 111 5 - 1:07:00:: 1:18:00:: 1:12:00:: 1:09:00:: 1:12:01:: Sense Frequency 1:07:00:: = 1 1:18:00:: = 9 1:12:00:: = 29 1:09:00:: = 44 1:12:01:: = 28 * length 111 5 - 1:06:00:: 1:07:00:: 1:07:02:: 1:07:03:: 1:07:01:: Sense Frequency 1:06:00:: = 6 1:07:00:: = 27 1:07:02:: = 23 1:07:03:: = 23 1:07:01:: = 32 I chose these six words because they have a good number of senses (more than 2) and an acceptable balanced distribution, so there is no sense that totally dominates the distribution. For the experiments, I used the 70-30 ratio as in assignment 3. I tried other ratios and the results were approximately the same with some ratios having worse results, so I decided to adopt his ratio scheme. The reasoning behind trying different ratios is that in all these examples, the number of examples we have to train and test is relatively small as compared to the line-data examples. So I tried of using a higher ratio for training such as 75-25 80-20 90-10 but the results did not show any improvement and in the case of the 90-10 they were actually worse. Therefor the same window size and frequency cutoff as in assignment 3 were used for the experiments here. The following results were obtained for each word: argument-data/experiment.txt ----------------------------- window size|frequency cutoff|accuracy 0 1 Accuracy: 0.3929 [Total number of correct 11 out of 28] 0 2 Accuracy: 0.3929 [Total number of correct 11 out of 28] 0 5 Accuracy: 0.3929 [Total number of correct 11 out of 28] 2 1 Accuracy: 0.5357 [Total number of correct 15 out of 28] 2 2 Accuracy: 0.4643 [Total number of correct 13 out of 28] 2 5 Accuracy: 0.5357 [Total number of correct 15 out of 28] 10 1 Accuracy: 0.4286 [Total number of correct 12 out of 28] 10 2 Accuracy: 0.5000 [Total number of correct 14 out of 28] 10 5 Accuracy: 0.3929 [Total number of correct 11 out of 28] 25 1 Accuracy: 0.4643 [Total number of correct 13 out of 28] 25 2 Accuracy: 0.5000 [Total number of correct 14 out of 28] 25 5 Accuracy: 0.3571 [Total number of correct 10 out of 28] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.2500 [Total number of correct 7 out of 28] 0 2 Accuracy: 0.3571 [Total number of correct 10 out of 28] 0 5 Accuracy: 0.5000 [Total number of correct 14 out of 28] 2 1 Accuracy: 0.5714 [Total number of correct 16 out of 28] 2 2 Accuracy: 0.6429 [Total number of correct 18 out of 28] 2 5 Accuracy: 0.5357 [Total number of correct 15 out of 28] 10 1 Accuracy: 0.3571 [Total number of correct 10 out of 28] 10 2 Accuracy: 0.5714 [Total number of correct 16 out of 28] 10 5 Accuracy: 0.4643 [Total number of correct 13 out of 28] 25 1 Accuracy: 0.3571 [Total number of correct 10 out of 28] 25 2 Accuracy: 0.6429 [Total number of correct 18 out of 28] 25 5 Accuracy: 0.5714 [Total number of correct 16 out of 28] aspect-data/experiment.txt -------------------------- window size|frequency cutoff|accuracy 0 1 Accuracy: 0.4667 [Total number of correct 14 out of 30] 0 2 Accuracy: 0.4667 [Total number of correct 14 out of 30] 0 5 Accuracy: 0.4667 [Total number of correct 14 out of 30] 2 1 Accuracy: 0.4333 [Total number of correct 13 out of 30] 2 2 Accuracy: 0.3333 [Total number of correct 10 out of 30] 2 5 Accuracy: 0.4000 [Total number of correct 12 out of 30] 10 1 Accuracy: 0.4333 [Total number of correct 13 out of 30] 10 2 Accuracy: 0.3333 [Total number of correct 10 out of 30] 10 5 Accuracy: 0.3000 [Total number of correct 9 out of 30] 25 1 Accuracy: 0.4333 [Total number of correct 13 out of 30] 25 2 Accuracy: 0.4000 [Total number of correct 12 out of 30] 25 5 Accuracy: 0.4333 [Total number of correct 13 out of 30] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.6667 [Total number of correct 20 out of 30] 0 2 Accuracy: 0.5333 [Total number of correct 16 out of 30] 0 5 Accuracy: 0.5333 [Total number of correct 16 out of 30] 2 1 Accuracy: 0.4000 [Total number of correct 12 out of 30] 2 2 Accuracy: 0.5000 [Total number of correct 15 out of 30] 2 5 Accuracy: 0.4000 [Total number of correct 12 out of 30] 10 1 Accuracy: 0.5000 [Total number of correct 15 out of 30] 10 2 Accuracy: 0.3667 [Total number of correct 11 out of 30] 10 5 Accuracy: 0.3333 [Total number of correct 10 out of 30] 25 1 Accuracy: 0.4333 [Total number of correct 13 out of 30] 25 2 Accuracy: 0.5000 [Total number of correct 15 out of 30] 25 5 Accuracy: 0.3333 [Total number of correct 10 out of 30] edge-data/experiment.txt ------------------------ window size|frequency cutoff|accuracy 0 1 Accuracy: 0.3871 [Total number of correct 12 out of 31] 0 2 Accuracy: 0.3871 [Total number of correct 12 out of 31] 0 5 Accuracy: 0.3871 [Total number of correct 12 out of 31] 2 1 Accuracy: 0.2903 [Total number of correct 9 out of 31] 2 2 Accuracy: 0.2903 [Total number of correct 9 out of 31] 2 5 Accuracy: 0.3226 [Total number of correct 10 out of 31] 10 1 Accuracy: 0.4516 [Total number of correct 14 out of 31] 10 2 Accuracy: 0.3871 [Total number of correct 12 out of 31] 10 5 Accuracy: 0.3226 [Total number of correct 10 out of 31] 25 1 Accuracy: 0.3871 [Total number of correct 12 out of 31] 25 2 Accuracy: 0.2581 [Total number of correct 8 out of 31] 25 5 Accuracy: 0.2903 [Total number of correct 9 out of 31] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.3226 [Total number of correct 10 out of 31] 0 2 Accuracy: 0.3548 [Total number of correct 11 out of 31] 0 5 Accuracy: 0.3548 [Total number of correct 11 out of 31] 2 1 Accuracy: 0.3548 [Total number of correct 11 out of 31] 2 2 Accuracy: 0.2903 [Total number of correct 9 out of 31] 2 5 Accuracy: 0.2903 [Total number of correct 9 out of 31] 10 1 Accuracy: 0.2903 [Total number of correct 9 out of 31] 10 2 Accuracy: 0.2581 [Total number of correct 8 out of 31] 10 5 Accuracy: 0.3548 [Total number of correct 11 out of 31] 25 1 Accuracy: 0.3871 [Total number of correct 12 out of 31] 25 2 Accuracy: 0.3548 [Total number of correct 11 out of 31] 25 5 Accuracy: 0.1613 [Total number of correct 5 out of 31] energy-data/experiment.txt -------------------------- window size|frequency cutoff|accuracy 0 1 Accuracy: 0.5455 [Total number of correct 24 out of 44] 0 2 Accuracy: 0.5455 [Total number of correct 24 out of 44] 0 5 Accuracy: 0.5455 [Total number of correct 24 out of 44] 2 1 Accuracy: 0.5000 [Total number of correct 22 out of 44] 2 2 Accuracy: 0.4773 [Total number of correct 21 out of 44] 2 5 Accuracy: 0.4091 [Total number of correct 18 out of 44] 10 1 Accuracy: 0.5909 [Total number of correct 26 out of 44] 10 2 Accuracy: 0.5909 [Total number of correct 26 out of 44] 10 5 Accuracy: 0.6364 [Total number of correct 28 out of 44] 25 1 Accuracy: 0.5227 [Total number of correct 23 out of 44] 25 2 Accuracy: 0.5909 [Total number of correct 26 out of 44] 25 5 Accuracy: 0.5455 [Total number of correct 24 out of 44] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.4545 [Total number of correct 20 out of 44] 0 2 Accuracy: 0.4773 [Total number of correct 21 out of 44] 0 5 Accuracy: 0.4773 [Total number of correct 21 out of 44] 2 1 Accuracy: 0.3636 [Total number of correct 16 out of 44] 2 2 Accuracy: 0.3864 [Total number of correct 17 out of 44] 2 5 Accuracy: 0.2955 [Total number of correct 13 out of 44] 10 1 Accuracy: 0.4091 [Total number of correct 18 out of 44] 10 2 Accuracy: 0.4318 [Total number of correct 19 out of 44] 10 5 Accuracy: 0.3864 [Total number of correct 17 out of 44] 25 1 Accuracy: 0.4318 [Total number of correct 19 out of 44] 25 2 Accuracy: 0.3864 [Total number of correct 17 out of 44] 25 5 Accuracy: 0.5000 [Total number of correct 22 out of 44] hope-data/experiment.txt ------------------------ window size|frequency cutoff|accuracy 0 1 Accuracy: 0.4545 [Total number of correct 15 out of 33] 0 2 Accuracy: 0.4545 [Total number of correct 15 out of 33] 0 5 Accuracy: 0.4545 [Total number of correct 15 out of 33] 2 1 Accuracy: 0.4545 [Total number of correct 15 out of 33] 2 2 Accuracy: 0.4242 [Total number of correct 14 out of 33] 2 5 Accuracy: 0.3939 [Total number of correct 13 out of 33] 10 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 10 2 Accuracy: 0.3939 [Total number of correct 13 out of 33] 10 5 Accuracy: 0.3939 [Total number of correct 13 out of 33] 25 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 25 2 Accuracy: 0.3636 [Total number of correct 12 out of 33] 25 5 Accuracy: 0.3333 [Total number of correct 11 out of 33] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 0 2 Accuracy: 0.3636 [Total number of correct 12 out of 33] 0 5 Accuracy: 0.4242 [Total number of correct 14 out of 33] 2 1 Accuracy: 0.2424 [Total number of correct 8 out of 33] 2 2 Accuracy: 0.2424 [Total number of correct 8 out of 33] 2 5 Accuracy: 0.3030 [Total number of correct 10 out of 33] 10 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 10 2 Accuracy: 0.2727 [Total number of correct 9 out of 33] 10 5 Accuracy: 0.1515 [Total number of correct 5 out of 33] 25 1 Accuracy: 0.4242 [Total number of correct 14 out of 33] 25 2 Accuracy: 0.1818 [Total number of correct 6 out of 33] 25 5 Accuracy: 0.2727 [Total number of correct 9 out of 33] length-data/experiment.txt -------------------------- window size|frequency cutoff|accuracy 0 1 Accuracy: 0.2121 [Total number of correct 7 out of 33] 0 2 Accuracy: 0.2121 [Total number of correct 7 out of 33] 0 5 Accuracy: 0.2121 [Total number of correct 7 out of 33] 2 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 2 2 Accuracy: 0.3030 [Total number of correct 10 out of 33] 2 5 Accuracy: 0.2727 [Total number of correct 9 out of 33] 10 1 Accuracy: 0.3333 [Total number of correct 11 out of 33] 10 2 Accuracy: 0.4242 [Total number of correct 14 out of 33] 10 5 Accuracy: 0.3939 [Total number of correct 13 out of 33] 25 1 Accuracy: 0.3030 [Total number of correct 10 out of 33] 25 2 Accuracy: 0.4242 [Total number of correct 14 out of 33] 25 5 Accuracy: 0.3333 [Total number of correct 11 out of 33] --------------------------------------------------------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.2121 [Total number of correct 7 out of 33] 0 2 Accuracy: 0.2121 [Total number of correct 7 out of 33] 0 5 Accuracy: 0.2121 [Total number of correct 7 out of 33] 2 1 Accuracy: 0.3939 [Total number of correct 13 out of 33] 2 2 Accuracy: 0.2727 [Total number of correct 9 out of 33] 2 5 Accuracy: 0.3636 [Total number of correct 12 out of 33] 10 1 Accuracy: 0.3636 [Total number of correct 12 out of 33] 10 2 Accuracy: 0.4545 [Total number of correct 15 out of 33] 10 5 Accuracy: 0.3030 [Total number of correct 10 out of 33] 25 1 Accuracy: 0.2121 [Total number of correct 7 out of 33] 25 2 Accuracy: 0.2424 [Total number of correct 8 out of 33] 25 5 Accuracy: 0.3636 [Total number of correct 12 out of 33] Here I must note that the same 2 cases of experiments as in assignment 3 were used. For clarity, I will briefly describe both these experiments: 1) Split the data into TRAIN and TEST partitions once and then run the experiments using these partitions. This experiment was carried out to compare the performance of the different combinations using the same TRAIN and TEST data (constant environment). The results of these experiments are summarized in TABLE 1. 2) For each of the combination of window size and frequency cutoffs, a different set of TRAIN and TEST partitions was used. This means that these partitions differ in every run because they are randomly distributed from the entire line data. This experiment was carried to see how different partitions can affect the performance of each experiment. As can be observed from the data below, the performances of the 2 experiments are very similar and so the changing environments did not produce a great impact. However, in some cases performance was degraded slightly with different partitions (in experiment 2). Here, I must note that these 2 sets of experiments do not necessarily imply that having a constant environment is always better. It seems that it is a matter of finding the set of TRAINING data that is diverse enough that it reflects with most accuracy any set of test data. The above is extracted from the experiments.txt file from assignment 3. It was also planned to further split the training examples into subgroups, run the experiment on each group and then get the average accuracy over all the groups. However, after careful consideration, I decided not carry out this experiment because of the small number of examples provided and thus would not produce any significant results in the accuracy. As can be seen from the output of the experiments, the results look very poor compared to the "line" data. One reason for the poor performance is the lack of examples. Clearly, if the classifier is provided with about one thousand examples to train on ("line" data), it will perform better than when provided with about one hundred examples (this data). Also, it seemed that in some cases the performance actually dropped when a window has been introduced. One reason for this is that most of the features where not seen in the training examples. This is due, again, to the small number of examples provided. Another reason for the poor performance with this data, is that in some cases the features do not provide a clear-cut hint to as to what the sense is. The "line" data was comparatively straightforward in that the features identified the sense in a clear manner. I believe this data is more complex than the "line" data. Conclusion: ---------- This assignment, gave me some insight into the pitfalls of a Naive Bayesian Classifier. One such insight, is that a lot of data is required to train the classifier if it is to perform at an acceptable level. Also the features must provide some hint as to the classification and cannot be totally vague. On the other hand, the classifier seemed to handle noise very good. I noticed this when I run the classifier with all of the data for some of the words. This includes a lot of noise, in the form of having the same instance classified with more than one sense. With such data the accuracy dropped slightly as compared with the filtered data but it was comparable. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Name : Nitin Agarwal Date 11-Nov-02 Class : Natural Language Processing CS8761 =============================================================== Objective To create sense tagged text and explore supervised approaches to the word sense disambiguation. Procedure Let us consider a huge corpus in which one word occurs in all the lines and could have various meanings. We have to split this corpus into 2 sets. We analyze one set using Statistical Natural Language Processing techniques to get the training data. Using this data we try to figure out the meaning of the target word in the test data. After this we compare the meaning of the word obtained using this method with the actual meaning and find the accuracy of this method. The value of the accuracy could be anywhere between 0 and 1. The closer the value is to 1 the better the method is. The above mentioned test has been done using the word "Line". The experiment has been performed in four phases. Phase 1(select.pl) When this file is run, we get two output files namely TRAIN and TEST that have various lines containing the word "Line" with different meanings associated with them. Before writing into the two output files all the input lines are randomized so that various senses of the word "Line" are mixed together. Phase 2(feat.pl) The outcome of this file is a feature vector. This vector has distinct features associated with it depending on the sense of "Line" in its each occurrence. The vector also depends on the values of window size and the frequency value specified in the command line. The more the window size is, the more features would be associated with each occurrence of "Line". We would later see that a higher window size results in estimating the sense of each occurrence with more accuracy. Conversely, a high frequency value would result in a slightly lower accuracy as this value is the cut-off value for frequency and only all the words that occur more than this value in all the windows would be considered in the feature vector. This file is only executed for the TRAIN file from select.pl to obtain the features. Phase 3(convert.pl) The output of feat.pl together with first TRAIN and then TEST is the input to program and the output is a binary feature vector. This binary vector shows if any instance has any of the features. This file is run for several different combinations of window and frequency values. Phase 4(nb.pl) This program processes the binary vectors for a pair of TEST.FV and TRAIN.FV with the same values for window size and frequency cut-off. Naive Bayesian classifier is implemented to determine the sense of the target word in TEST file using the data obtained from TRAIN file. For the word types that did not occur in the vector, smoothing is performed to give them a small probability value. This is because, we assume that these words just did not occur in this set of data and may occur in another set of data. We get 12 sets of outputs after running this file for all window and frequency combinations. These values are tabulated as under. The row values are the window size and the column values are the cut-off values for frequency. window size/frequency cutoff 1 2 5 0 0.5571 0.5571 0.5571 2 0.7250 0.7113 0.6908 10 0.8086 0.7973 0.7867 25 0.8267 0.8132 0.8012 Observation After looking at the above table we see that, for all the experiments with the window size of 0 we get the same values of accuracy as 0.5571. This is so because if window size is 0, then we are not getting any words and any frequency value does not matter. There is no feature vector and we just have some random senses assigned to each instance. Of these instances some happen to be correct and hence we have non-zero values for all the cases with the window size of 0. When window size is 2, we get 4 words for each instance (2 on left and 2 on right). If frequency cut-off is taken as 1 then any word that occurs more than once in all the instances is considered into the feature vector. Using the program nb.pl which is explained above we get accuracy of 0.7250 which goes to show that we estimated the sense of about 72% of the instances correctly. With the higher values of frequency we notice that this value of accuracy is declining. The reason being, when the frequency cut-off is higher, lesser number of words are in the feature vector, limiting our data from TRAIN file resulting in a lower accuracy. For the window size of 10 we get an even higher value for accuracy. This goes on to say that there are some good features of a word even as far as about 10 words from it. Therefore, it is a good idea to consider a window that is not too small. As discussed above, again accuracy drops with an increase in the frequency cut-off for the same reasons cited above. Finally, we run the experiment with the window size of 25. Again, as was expected we have increase in accuracy. Nevertheless, this time the difference is very little. In addition, with the increasing value of frequency cut-off we have accuracy that is dropping, which is similar to the cases considered earlier. However, there is an important point worth mentioning before we conclude. We notice from the above table that with the increase in window size, although there is an increase in accuracy, it is very low. And accuracy improves only marginally by increasing the window size from 10 to 25. If we increase the window size further, we may still get a better accuracy but that would be hardly worth the processing time required to process the data with that window size. Hence, in order to get even better accuracy values, it is not just enough to increase the window size. Instead, we should work to improve the classifier that can give us better results. Assignment 4 ------------ Objective TO continue the exploration of supervised approaches to word sense disambiguation using data from Open mind. Experiment First we write programs to put the data obtained from the Open mind into the format similar to the line data that was used in the previous assignment. Following programs have been written to achieve this. separator1.pl and separator2.pl The two programs are used to separate the data in the files cs8761-umd.full.detailed and ids-to-sentences respectively. These files divide the contents of each file into several files depending on the target word. Therefore, after executing above 2 files we have a list of output files. separator1.pl outputs all the files in a format .tags. Hence, if we have an instance called "line" then this file would output "line.tags". Similarly, separator2.pl produces a list of files named as .data. For "line" the file would be "line.data". The two files are run as follows perl separator1.pl perl separator2.pl readtaginfo.pl This file reads the information from "*.tags " files created using separator.pl and analyzes the information to determine good words. The program returns a file with the tag information for all the words and also marks a few good words. The good words are as defined in the assignment statement on Dr. Ted Pedersen's web page for the class CS8761. This program when executed computes the following: 1) Senses: This contains the information about unique senses associated with a word. The number of senses excludes "unclear" and "unlisted-sense" if they were assigned for any word. Good sense: The program checks for all the senses of a word that occur more than 40% and less than 160% of the average sense. 2) Distribution: A good word needs to be evenly distributed among many of its senses. The more evenly distributed the senses are, the better the word is. Good distribution: If more than half of the senses of a word have good sense then the word is considered to have a good distribution. 3) Instances: This is the total number of unique instances for any given word. Good instance: Any word that has more than 100 instances is considered as a good instance. 4) Agreement: Agreement for a given word is the agreement between the users who tagged that word. If 3 users tagged an instance of a word and they all tagged the same sense then the agreement for that sense is 1. If just 2 of them agreed on a sense then the agreement would be 0.67. However, if all of them disagree then we would have agreement equal to 0.33. The agreement of a sentence is given as the average of the agreements of all the individual instances for that word. Good agreement: I have considered a word to have good agreement if its agreement is more than 60% and less than 100%. The later condition is to check for words that were tagged only by one user. And, the former condition is close to the value (65%) mentioned by Dr. Rada Michalcea in her colloquium. All the "Good" values mentioned above were narrowed down by the author using trial and error method to get 6 best words for this assignment. Others may identify 6 good words on an entirely different set of conditions depending on their choice. However, this program yielded only 5 good words namely- "depth", "difficulty", "edge", "length" and "shape". Sixth word was selected by inspection from the output of this file. And this last word is "behavior". This word was also selected on a similar criteria as is discussed above, although manually. The program is run as follows: perl readtaginfo.pl arc.tags depth.tags shape.tags............. The resulting file is "taginfo". sensor.pl This file is executed for the 6 good words obtained from the above program. This will read a pair of .tags and .data file to give a set of files that would be named based on the senses assigned to that word. The number of output files would be equal to the number of senses for a particular word. The program takes care not to include multiple occurrences of the same instance. Furthermore, care has been taken to assign an instance to a sense that most users thought was correct. For instance, if 3 users tagged an instance of a word and all of them agreed upon it, then the instance would be assigned to this sense. However, if just 2 of them agreed upon it then the instance would be assigned to the most agreed upon sense. Considering, the case when all of them disagree, the instance is assigned randomly to a sense. If an instance was thought of as "unclear" or "unlisted-sense" then again it is assigned to a sense randomly. The program is run as follows: perl sensor.pl If the instance is line then the command would like this perl sensor.pl line While running this file, care should be to change the regular expression in the file. The new regular expression could be obtained from the first line of .config file. At this point we run the output obtained from sensor.pl for various words using the program files developed in the Assignment 3 to obtain the accuracy for each word using Naive Bayesian classifier. The following tables show the accuracy values for the 6 good words for all 12 combinations of window size and frequency cut-offs as was done for line data in assignment 3. behavior window size/frequency cutoff 1 2 5 0 0.55 0.55 0.55 2 0.55 0.55 0.55 10 0.5 0.5 0.55 25 0.5 0.5 0.525 depth window size/frequency cutoff 1 2 5 0 0.3333 0.3333 0.3333 2 0.2667 0.2444 0.3333 10 0.2667 0.2444 0.2444 25 0.2667 0.2444 0.2444 difficulty window size/frequency cutoff 1 2 5 0 0.375 0.375 0.375 2 0.35 0.35 0.375 10 0.325 0.325 0.325 25 0.325 0.325 0.325 edge window size/frequency cutoff 1 2 5 0 0.225 0.225 0.225 2 0.25 0.25 0.25 10 0.275 0.275 0.275 25 0.3 0.3 0.275 length window size/frequency cutoff 1 2 5 0 0.1364 0.1364 0.1364 2 0.2045 0.1818 0.1591 10 0.1591 0.2045 0.1818 25 0.1591 0.2045 0.2045 shape window size/frequency cutoff 1 2 5 0 0.1282 0.1282 0.1282 2 0.1026 0.1282 0.1282 10 0.1282 0.1282 0.1282 25 0.1282 0.1282 0.1282 The values in the above tables do not tally with what was expected. The values in many tables decrease with the increase with window size and at times they increase with increase with increase in frequency cut-off. This totally contradicts the statement made above about the accuracy values. Moreover, the values are too low. It is difficult to say how other words would behave if this is the behavior of good words. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Kailash Aurangabadkar Assignment # 4 A continuing quest for meanings in li(n|f)e -------------------------------------------------------------------------------- The objective of the assignment is to continue to explore supervised approaches to word sense disambiguation. In this assignment a sense tagged text is created. Then Naïve Bayesian classifier is implemented to perform word sense disambiguation.This assignment focuses on fixing the errors in the assignment number 3, so that we get optimal values for accuracy. After fixing the classifier we identify six "good" words in the Open Mind data that we created as a class. Good words are to be identified using following criteria: *Has a reasonable rate of agreement among the two taggers. *Has a somewhat balanced distribution of senses. At the very least avoids the case where a single sense dominates the distribution. *Has a reasonable number of examples. Once the good words have been identified we have to convert the Open Mind data into the form of the line data so we can use our Assignment 3 classifier. By executing the classifier on that data we have to resolve the word sense disambiguation problem in the open mind data. -------------------------------------------------------------------------------- Word Sense Disambiguation: The task of disambiguation is to determine which of the senses of an ambiguous word is invoked in a particular use of the word. -------------------------------------------------------------------------------- Process: In this assignment the Naïve bayesian classification algorithm is used to assign the score to every instance in Test data for every sense possible for the word. In this classifier the central idea is to look around ambiguous words in a large context window. Each content word adds on information to the disambiguation of the target word. We first find the Probability vector from Train data for each sense and feature combination. If we do not see a word in that context then we apply Witten Bell smoothing to find the probability of that event. Then we find the probability of a content word for every sense in the Test data window by using the Naïve Bayes assumption. The sense with which we get maximum probability for that instance from Test data is assigned as the sense of that instance. The accuracy of the algorithm is then computed by comparing the actual sense of that word in that line with the sense assigned by us. Then algorithms have to be developed to analyze the open mind data and to convert it into "line" data format. -------------------------------------------------------------------------------- The assignment consists of two parts: Part 1: Part 1 consists of checking and fixing four programs of assignment 3, which are:- 1. Select.pl:- This program divides a sense tagged corpus of text into Training data and Test data. 2. Feat.pl:- This program finds the features in the specified window around the target word. It also checks for the frequency of the features to be more than the cutoff specified. 3. Convert.pl:- This program gives us the feature vector table which shows whether the features obtained using feat.pl are present or not in the input file. 4. Nb. Pl:- This program assigns sense tags to untagged data from test. It does this by using the Naïve Bayesian algorithm, smoothing its value by using Witten - Bell smoothing -------------------------------------------------------------------------------- Experiment Results of Part 1:- The accuracy values for each combination of window and frequency cutoff is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.5350 0 2 0.5350 0 5 0.5350 -------------------------------------------------------------------------------- 2 1 0.7468 2 2 0.7476 2 5 0.7378 --------------------------------------------------------------------------------- 10 1 0.8212 10 2 0.8166 10 5 0.8148 -------------------------------------------------------------------------------- 25 1 0.8442 25 2 0.8392 25 5 0.8378 -------------------------------------------------------------------------------- We see in general that as the window size goes on increasing the accuracy value goes on increasing. This is quite obvious as we see that if the window size increases then the content words around the ambiguous word under consideration. This gives us more and more information about the particular sense occurring in that instance. We also see that the accuracy value has no particular effect of the value of frequency cutoff. This because the Naive Bayesian classifier is impenetrable to noise, and hence irrelevant data occurring around words have no remarkable effect on the sense tagging. The trend is followed generally in the observations made from the experiments performed and summarized in the table above. Thus we see that making window size as large as possible we can get more and more accuracy in assigning senses to ambiguous words. ----------------------------------------------------------------------------------- Part 2: Part 2 consists of analyzing the open mind data for finding good words and to convert the data into "line" data format used by us during assignment 3. For this purpose 4 programs were created: getsentence.pl:- This program splits the file "ids-to-sentences¨ into seperate files for each word type. getsenses.pl:- This program splits the file "cs8761-umd.full.detailed¨ into seperate files for each word type. getinfo.pl:- This program takes as argument the sense tag files for each word seperated by getsenses.pl to find the values of criteria to be checked to consider a word as a good word. getsenseword.pl:- This program splits the file containing instances for a single word into files for each sense tag occuring with the word, so that it is in "line" data format. ------------------------------------------------------------------------------------ Experiment Results of Part 2:- A sample output from the getinfo.pl file for the sample word "Difficulty" is shown below: tagdifficulty Examples : 135 Tags : 4 TAG: difficulty%1:07:00:: , NUMBER: 52 TAG: difficulty%1:26:00:: , NUMBER: 57 TAG: difficulty%1:09:02:: , NUMBER: 43 TAG: difficulty%1:04:00:: , NUMBER: 59 Agreement : 0.719753086419753 From this output we come to know that the word difficulty has 135 instances, 4 senses and there was 72% agreement between the persons who were tagging the word on open mind project website. This output also gives us the distribution among senses. I have chose the following six words depending on the output of getinfo.pl : Difficulty {Output of getinfo.pl tagdifficulty Examples : 135 Tags : 4 TAG: difficulty%1:07:00:: , NUMBER: 52 TAG: difficulty%1:26:00:: , NUMBER: 57 TAG: difficulty%1:09:02:: , NUMBER: 43 TAG: difficulty%1:04:00:: , NUMBER: 59 Agreement : 0.719753086419753 } Art {Output of getinfo.pl tagart Examples : 110 Tags : 4 TAG: art%1:04:00:: , NUMBER: 35 TAG: art%1:06:00:: , NUMBER: 31 TAG: art%1:09:00:: , NUMBER: 25 TAG: art%1:10:00:: , NUMBER: 19 Agreement : 1 } Captain {Output of getinfo.pl tagcaptain Examples : 180 Tags : 9 TAG: unlisted-sense , NUMBER: 1 TAG: captain%1:18:00:: , NUMBER: 7 TAG: captain%1:18:01:: , NUMBER: 11 TAG: captain%1:18:02:: , NUMBER: 31 TAG: unclear , NUMBER: 44 TAG: captain%1:18:03:: , NUMBER: 46 TAG: captain%1:18:04:: , NUMBER: 27 TAG: captain%1:18:05:: , NUMBER: 20 TAG: captain%1:18:06:: , NUMBER: 62 Agreement : 0.812962962962963 } Length {Output of getinfo.pl taglength Examples : 149 Tags : 6 TAG: length%1:07:01:: , NUMBER: 47 TAG: length%1:07:02:: , NUMBER: 34 TAG: unclear , NUMBER: 3 TAG: length%1:07:03:: , NUMBER: 36 TAG: length%1:06:00:: , NUMBER: 21 TAG: length%1:07:00:: , NUMBER: 45 Agreement : 0.875838926174497 } Distribution {Output of getinfo.pl tagdistribution Examples : 135 Tags : 6 TAG: unclear , NUMBER: 7 TAG: distribution%1:04:00:: , NUMBER: 56 TAG: unlisted-sense , NUMBER: 2 TAG: distribution%1:04:01:: , NUMBER: 76 TAG: distribution%1:07:00:: , NUMBER: 38 TAG: distribution%1:09:00:: , NUMBER: 26 Agreement : 0.769135802469136 } Unit {Output of getinfo.pl tagunit Examples : 151 Tags : 9 TAG: unit%1:03:00:: , NUMBER: 11 TAG: unlisted-sense , NUMBER: 2 TAG: unit%1:14:00:: , NUMBER: 82 TAG: unit%1:23:00:: , NUMBER: 12 TAG: unit%1:24:00:: , NUMBER: 65 TAG: unit%1:06:01:: , NUMBER: 19 TAG: unit%1:17:00:: , NUMBER: 47 TAG: unit%1:09:00:: , NUMBER: 18 TAG: unclear , NUMBER: 9 Agreement : 0.724613686534216 } The accuracy values for each combination of window and frequency cutoff for "difficulty" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.3095 0 2 0.3095 0 5 0.3095 -------------------------------------------------------------------------------- 2 1 0.4286 2 2 0.4048 2 5 0.3810 --------------------------------------------------------------------------------- 10 1 0.4286 10 2 0.4524 10 5 0.4048 -------------------------------------------------------------------------------- 25 1 0.4048 25 2 0.3333 25 5 0.3333 -------------------------------------------------------------------------------- The accuracy values for each combination of window and frequency cutoff for "art" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.3143 0 2 0.3143 0 5 0.3143 -------------------------------------------------------------------------------- 2 1 0.3429 2 2 0.3143 2 5 0.3143 --------------------------------------------------------------------------------- 10 1 0.3429 10 2 0.4286 10 5 0.4286 -------------------------------------------------------------------------------- 25 1 0.457 25 2 0.3714 25 5 0.3714 -------------------------------------------------------------------------------- The accuracy values for each combination of window and frequency cutoff for "captain" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.3148 0 2 0.3148 0 5 0.3148 -------------------------------------------------------------------------------- 2 1 0.3148 2 2 0.3333 2 5 0.3148 --------------------------------------------------------------------------------- 10 1 0.4074 10 2 0.3148 10 5 0.3148 -------------------------------------------------------------------------------- 25 1 0.3889 25 2 0.4259 25 5 0.3148 -------------------------------------------------------------------------------- The accuracy values for each combination of window and frequency cutoff for "length" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.2766 0 2 0.2766 0 5 0.2766 -------------------------------------------------------------------------------- 2 1 0.3404 2 2 0.3191 2 5 0.2979 --------------------------------------------------------------------------------- 10 1 0.3404 10 2 0.3404 10 5 0.2125 -------------------------------------------------------------------------------- 25 1 0.2125 25 2 0.2766 25 5 0.2766 -------------------------------------------------------------------------------- The accuracy values for each combination of window and frequency cutoff for "distribution" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.3898 0 2 0.3898 0 5 0.3898 -------------------------------------------------------------------------------- 2 1 0.5000 2 2 0.5000 2 5 0.5000 --------------------------------------------------------------------------------- 10 1 0.3810 10 2 0.4524 10 5 0.3571 -------------------------------------------------------------------------------- 25 1 0.4524 25 2 0.3333 25 5 0.5238 -------------------------------------------------------------------------------- The accuracy values for each combination of window and frequency cutoff for "unit" is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.3458 0 2 0.3458 0 5 0.3458 -------------------------------------------------------------------------------- 2 1 0.4792 2 2 0.4375 2 5 0.4375 --------------------------------------------------------------------------------- 10 1 0.3125 10 2 0.3333 10 5 0.3333 -------------------------------------------------------------------------------- 25 1 0.4167 25 2 0.4167 25 5 0.3333 -------------------------------------------------------------------------------- These seem to be very low accuracy values. This is because there were a few examples (in order of 100) as compared to line data (in order of 4000). As the number of instances is less there is less data to learn from. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ cs8761 Natural Language Processing Assignment 3 Archana Bellamkonda November 1, 2002. Problem Description : ( for part II - Naive Bayesian Classifier Implementation ) ------------------- The objective is to assign a sense to a given target word in a sentence using Naive Bayesian Classifier. We start of with some sets of data where we have instances in a particular sense that are previously collected. In this assignment, we do the sense tagging in four steps - 1) select.pl :- This program collects all instances from given data and randomizes the instances after adding information about the sense from which they are extracted, and divides these instances into two sets of data, TEST and TRAIN depending on the percentage entered by the user. 2) feat.pl :- This program identifies all word types that occur within "w" postiions to the left or right of target word, and that occur "more than" "f" times in TRAIN data set. It doesnt include target word as a feature. 3) convert.pl :- This program converts the inpur file to a feature vector representation where features in output of feat.pl are read from standard input. Each instance in file is converted into series of binary values that indicate whether or not each type listed in list of features output by feat.pl has occured within specified window around the target word in the given instance. NOTE:- It also includes number of unobserved features in the specified window around the target word for every instance. This is the last number in the feature vector. 4) nb.pl :- This program wil learn a Naive Bayesian classifier and use that classifier to assign sense tags to test data and the senses are printed along with instance ids, actual sense and probability of the assigned sense. Experiment : ---------- Frequency cutoff 1 2 5 windowsize |__________________________________________ 0 | 0.5454 0.5289 0.5248 2 | 0.8512 0.8471 0.8471 10 | 0.8926 0.8595 0.8223 25 | 0.8801 0.8800 0.7933 Experiments are done with the above shown window sizes and frequency cutoffs. The results are as shown for all twelve combinations. (Here Experiment is done with phone2 and division2 files) Observations : ------------ -->Expected : -------- As window size increases, number of features that we observe increase and hence we we will learn more about context of a given word and hence we would be observing higher accuracy as window size increases. (taking frequency cut off to be 1). But, the meaning of a word depends on the surrounding context only. For example, meaning of a word in a sentence will generally not depend on meaning of word in other sentences. So, if we go on increasing the window size, starting from zero, we will observe significant increase in accuracy upto a certain level, and then when window size increases beyond the required context, accuracy will not increase significatnly. Observed : -------- As shown in the table above, we observed what we expected. Consider column for frequency cutoff 1. As window size increases, accuracy increased significantly, from 0.5454 to 0.8512. From then onwards, accuracy didnot increase significantly. -->Expected : -------- According to Zipf's law, most of the features are not repeated in a text. So, as frequency cutoff increases for a particular window size, there will be less number of features observed and hence we cant capture the context properly. Thus, accuracy decreases as frequency cutoff increases for a given window size. Observed : -------- As shown in the table above, we again observed what we expected. Consider any row for a particular window size. The accuracy values are decreasing as frequency cutoff increases. --> Expected : -------- When we consider the case where frequency cutoffs are increasing, we also estimate that there will be more number of features observed as window size increases and thus, we could expect accuracy to be more. Observed : -------- Consider the columns for frequency cutoffs 2 and 5. We observed what we expected. NOTE : ---- In some cases above, I used precision to be 16 digits after decimal as the probability values are very small and if considered for 4 digits after decimal, they are all zeros. OUTPUTS FOR TESTCASES : --------------------- Test Case 1: ----------- cord2 w7_039:13446: My line was cut. w7_039:13447: My line was cut. w7_039:13448: My line was cut. w7_039:13449: My line was cut. w7_039:13450: My line was cut. text2 w7_039:12446: The line of text is very unclear. w7_039:12447: The line of text is very unclear. w7_039:12448: The line of text is very unclear. w7_039:12449: The line of text is very unclear. w7_039:12450: The line of text is very unclear. Output: w7_039:13448: cord 0.0031 cord w7_039:13446: cord 0.0031 cord w7_039:13450: cord 0.0031 cord 1 Test Case 2 : ------------ cord2 w7_039:13446: A line B C w7_039:13447: D line E F w7_039:13448: G line H I w7_039:13449: J line K L w7_039:13450: M line N O text2 w7_039:12446: P Q line w7_039:12447: R S line w7_039:12448: T U line w7_039:12449: V W line w7_039:12450: X Y line Output: w7_039:12447: text 0.1429 text w7_039:13446: text 0.0212 cord w7_039:13447: text 0.0212 cord 0.333333333333333 Test Case 3 : ------------ cord2 w7_039:13446: A line B C w7_039:13447: D line E F w7_039:13448: G line H I w7_039:13449: J line K L w7_039:13450: M line N O text2 w7_039:12446: P Q line w7_039:12447: R S line w7_039:12448: T U line w7_039:12449: V W line w7_039:12450: X Y line Output: w7_039:13448: text 0.0212 cord w7_039:12446: text 0.1429 text w7_039:13446: text 0.0212 cord 0.333333333333333 CONCLUSIONS ---------- Noise : ----- Thus, we should select window size to be optimal. We should observe where we are getting noise, ie, what is the cut off where we are observing unwanted features in our context, and limit our window size to be lower than that cut off. Accuracy increases as window size increases and as noise also enters the window, there will be a drop in accuracy, though not significant. We have to observe where we are getting that drop. Optimal Combination : ------------------- The optimal combination will be the greatest window size below the cutoff where we are observing noise and and frequency cutoff of "1" as we would observe more features than in lower frequency cutoffs and hence would learn the context of a sense in a better way. Optimal Combination In Our Experiment: ------------------------------------- As seen from the table above, the optimal combination is predicted to be a window size of "10" and frequency cutoff of "1". We got the highest accuracy at that point and it is 0.8926 as shown. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ #----------------------------------# # Deodatta Bhoite # # CS8761 # # Assignment no 4 # # Date: 11-11-02 # #----------------------------------# Corrected Naive Bayesian output ------------------------------- The output of the experiments of the naive bayesian classifier for the line data(all senses) is as follows: --------------- W F Accuracy --------------- 0 1 0.5349 0 2 0.5349 0 5 0.5349 - - - - - - - - 2 1 0.7317 2 2 0.7197 2 5 0.6908 - - - - - - - - 10 1 0.8286 10 2 0.8201 10 5 0.7939 - - - - - - - - 25 1 0.8309 25 2 0.8137 25 5 0.8060 --------------- As we can see that the maximum accuracy is when the window size is 25 and frequency cutoff is 1. The accuracy increases as the window size increases and the accuracy also increases as the frequency cutoff decreases. This trend is probably observed because of the increase in the number of features as the window size increases and the frequency cutoff decreases. The accuracy for window size 0 is constant for any frequency cutoff. Here the classifier assigns the sense that occurs maximum number of times to all the instances. Searching for `Good' words --------------------------- How to run the scripts? - - - - - - - - - - - - For finding the `good' words in the open mind data use the scripts stats.pl and best.pl.They can be run as follows: % stats.pl /home/cs/tpederse/CS8761/Open-Mind/OMWE-tagging > summary % best.pl Working of the scripts - - - - - - - - - - - - The stats.pl script finds various statistics about the data, viz. number of tags per word, number of examples per word, number of tags per example, number of senses per word, the agreement ratio between the users who tagged the word, and the normalized variance of the distribution of senses. All this information is tabularized and written to "table.txt". It also assigns a particular sense to each instance id and stores in the file "assign.txt". The best.pl script reads the tabular information in table.txt and sorts it according to various columns and prints it in "sort.txt". For example, it sorts the variance in ascending order, whereas agreement ration in descending order, etc. It then assigns scores to various words based on their ranks in the sorted list in various columns and prints out the scores in ascending order. Least score signifies a good word, since it is the word that is top ranked in all sorted columns. Thus, the top n words can be identified as the top n words in the file "score.txt". Of course, if we give more importance to a particular feature (like agreement ratio or variance) we will take a weighted rank, but I have given equal importance to all the features so no weights are used. The `Good' words - - - - - - - - - The top 6 words in the CS-8761 project and their scores are as follows: captain 57 unit 58 chapter 66 volume 83 structure 84 depth 88 The top 6 words in the Open-mind data and their scores are as follows: circuit 195 feeling 221 restraint 224 experience 225 grip 232 interest 242 We note that the scores are not normalized hence, we cannot do cross comparison between the two outputs. So we will try to analyze by adding details of the features to the words: Word Ex T/E AR Var captain 180 2.03 0.73 0.0103 * unit 151 2.63 0.70 0.0148 chapter 150 2.50 0.92 0.0531 * volume 119 2.78 0.77 0.0171 * structure 150 2.08 0.79 0.0517 depth 150 2.13 0.48 0.0137 circuit 405 2.24 0.67 0.0089 feeling 198 2.39 0.62 0.0090 restraint 401 2.31 0.57 0.0045 experience 120 3.56 0.93 0.0004 * grip 410 2.45 0.76 0.0309 * interest 1703 2.07 0.70 0.0099 * (Ex=Examples/word; T/E=Tags/example; AR=agreement ratio; Var=variance of sense distribution) Note that I have considered T/E as a feature only because I believe having 2 users in agreement is more stronger proof for the word being of that sense than having only one user tagging the word with 100% agreement of sense. We reject depth, circuit, feeling, restraint for low agreement ratio. We don't select 'structure' for high variance, however we keep chapter because of it's high agreement ratio. We realize the drawbacks of the scheme of assigning equal weights to all criterion by observing that the top 3 words in the complete Open mind data are not good. However, we have manually picked up 3 good words from both sets: captain, chapter, volume from UMD and experience, grip and interest from the total Open mind data. Generating the data in `line' format ------------------------------------ Note that you have to run the stats.pl before you run this script, because this script uses the assignment output file generated by the stats script. The script "datagen.pl" generates the data in `line' format from the Open mind data. It is run as follows: % datagen.pl volume assign.txt /home/cs/tpederse/CS8761/Open-Mind/ids-to-sentences Where first argument is the argument is the word, second is the assignment file created by stats.pl and third is the ids-to-sentences file which contains the mapping from ids to sentences. The output is generated in the current working directory and the instances are divided into files according to the senses assigned in the assignment file. Naive bayesian classifier on `Good' words ----------------------------------------- The output of the naive bayesian classifier for the `good' words we selected is as follows: Captain --------------- W F Accuracy --------------- 0 1 0.2653 0 2 0.2653 0 5 0.2653 - - - - - - - - 2 1 0.3265 2 2 0.2449 2 5 0.2449 - - - - - - - - 10 1 0.3265 10 2 0.2653 10 5 0.2449 - - - - - - - - 25 1 0.3469 25 2 0.2857 25 5 0.2857 --------------- The classifier performs as expected for this word. The maximum accuracy is attained when window size is maximum and frequency cutoff is low. However, the difference in accuracies at window size 0 and 25 should have been higher. Chapter --------------- W F Accuracy --------------- 0 1 0.6000 0 2 0.6000 0 5 0.6000 - - - - - - - - 2 1 0.6667 2 2 0.6667 2 5 0.6000 - - - - - - - - 10 1 0.6667 10 2 0.6667 10 5 0.5556 - - - - - - - - 25 1 0.6222 25 2 0.6000 25 5 0.5778 --------------- Volume --------------- W F Accuracy --------------- 0 1 0.3429 0 2 0.3429 0 5 0.3429 - - - - - - - - 2 1 0.2571 2 2 0.2857 2 5 0.3143 - - - - - - - - 10 1 0.2857 10 2 0.2857 10 5 0.2857 - - - - - - - - 25 1 0.2857 25 2 0.2857 25 5 0.3143 --------------- Experience --------------- W F Accuracy --------------- 0 1 0.4762 0 2 0.4762 0 5 0.4762 - - - - - - - - 2 1 0.3810 2 2 0.3333 2 5 0.4762 - - - - - - - - 10 1 0.2381 10 2 0.3333 10 5 0.3333 - - - - - - - - 25 1 0.2857 25 2 0.3333 25 5 0.2857 --------------- Grip --------------- W F Accuracy --------------- 0 1 0.4865 0 2 0.4865 0 5 0.4865 - - - - - - - - 2 1 0.1261 2 2 0.1892 2 5 0.2523 - - - - - - - - 10 1 0.4685 10 2 0.4414 10 5 0.4324 - - - - - - - - 25 1 0.4414 25 2 0.4414 25 5 0.4685 --------------- Interest --------------- W F Accuracy --------------- 0 1 0.2834 0 2 0.2834 0 5 0.2834 - - - - - - - - 2 1 0.5749 2 2 0.5441 2 5 0.5298 - - - - - - - - 10 1 0.5996 10 2 0.5873 10 5 0.5667 - - - - - - - - 25 1 0.5996 25 2 0.5832 25 5 0.5667 --------------- Here the classifier performs as expected for the word `interest'. The features however seem to be more local than topical, hence there is no rise from window size 10 to 25. The classifier does not perform good in most of the cases for the above words. In fact, the accuracy often drops below the maximum classifier accuracy. However, I am unable to explain why this happens in some words. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Naive Bayesian Classifier Bridget Thomson McInnes 11 November, 2002 CS8761 ---------------------------------------------------------------------------- EXPERIMENTS: ---------------------------------------------------------------------------- | Window Size | Frequency Cutoffs | Accuracy | |--------------|----------------------|--------------| | 0 | 1 | 0.5386 | |--------------|----------------------|--------------| | 0 | 2 | 0.5346 | |--------------|----------------------|--------------| | 0 | 5 | 0.5471 | |--------------|----------------------|--------------| | 2 | 1 | 0.7299 | |--------------|----------------------|--------------| | 2 | 2 | 0.7122 | |--------------|----------------------|--------------| | 2 | 5 | 0.6747 | |--------------|----------------------|--------------| | 10 | 1 | 0.7679 | |--------------|----------------------|--------------| | 10 | 2 | 0.8142 | |--------------|----------------------|--------------| | 10 | 5 | 0.7896 | |--------------|----------------------|--------------| | 25 | 1 | 0.7920 | |--------------|----------------------|--------------| | 25 | 2 | 0.8345 | |--------------|----------------------|--------------| | 25 | 5 | 0.8257 | |----------------------------------------------------| ANALYSIS: ---------------------------------------------------------------------------- The accuracies for the window size of zero is approximately the same for each of the runs, approximately 50% accuracy. This is due to the fact that there are not any features for the classifier to train from. The classifier picks the most frequent sense of the instances in the training data and applies this sense to each instance in the test data. Given this it might be thought that the accuracy for a window size of zero should be the same no matter what the frequency count is. This is not the case because the training and test instances are randomly chosen each time the program is run. Therefore, the number of instances for each tag in the test and training files vary at each run of the program. The accuracy for the window size of two is definately higher than the accuracy for a window size of zero. This is as expected because with a window size of two the classifier is not picking the most frequent sense for every instance. It is using the features from the training data do determine the sense of the instances in the test data. The run made with the frequency cut off of five is lower than the runs with the frequency cut off of one and two. This is the case because with a frequency cut off of five, relevant features are not being included. The chance of a relevant feature occurring five times is smaller than it occurring once or twice. The accuracy for the window size of ten is greater than the accuracy for a window size of zero and two. This is expected because there is a greater number of features to identify the unique tag. Similarly with a window size of 25, the accuracy is greater than with a window size of ten. The frequency cutoff of two and five for a window size of 25 did not change the accuracy very much. But with a cutoff of one the accuracy decreased. This is due to the fact that the relevent features are not as unique to the tag as with a greater frequency cutoff. The decrease is not significant though because a Naive Bayesian Classifier is noise resistant so the low frequency words that are common to all the senses should factor out.+++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Report for Assignment-4 ----------------------- Suchitra Goopy 11/11/2002 Natural Language Processing Introduction: ------------- The main task to be performed in "word sense disambiguation" is to try to find which "sense" a particular word is used.Many words have one form, but different meanings.For example the word "line" can either mean a "line" at the ticket counter or it can mean a telephone "line". "The task of disambiguation is to determine which of the senses of an ambiguous word is invoked in a particular use of the word." -- "Foundations of Statistical Natural Language Processing" Christopher D.Manning and Hinrich Schutze How is the task performed? ------------------------- Consider the word "line" and it's different forms I stood in line at the bank. The telephone line is long. We cannot know the sense in which these two words are used unless we look at the surrounding words in the sentence.In this experiment we make use of the features or words surrounding the target word to help us in our task of disambiguation. Smoothing Techniques Used: -------------------------- Sometimes we see a lot of zero values or unobserved events in the data. If these values are used just as they are in the experiment,then it can lead to flawed results.So,we try to substitute the unobserved event by some probability value and also modify the value of the observed event, so that a good probability distribution is obtained.A distribution without zeros is much smoother than one with zeros.I have used the Whitten-Bell Smoothing Technique in my experiments,where the observed values are substituted by using the formula frequency/(types+tokens) and the unobserved values are obtained by types/z(types+tokens). z= unobserved types Results of the "Line" data: --------------------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.5133476 0 2 0.5133476 0 5 0.5133476 2 1 0.7289126 2 2 0.7165428 2 5 0.6967812 10 1 0.7635899 10 2 0.7635899 10 5 0.7243576 25 1 0.7921347 25 2 0.7921347 25 5 0.7662899 Analysis: --------- When the window size is zero then there are no features to help us decide the sense of the word.So in this case we assign the "majority classifier" or the most frequent sense to the instances in the test data. Care should be taken to see that the instances with different senses are evenly distributed among the training and test data set.I did run into problems with this regard.As long as the instances were evenly distributed the Naive Bayesian Classifier did a very good job of learning from the training data and then applying the results to the test data.Sometimes, when the data was not evenly distibuted,I had to execute select.pl again and ensure that I had an even distribution. As the window size increases,there are more number of features that are considered.Also with larger window sizes we can be sure that,the features that are very important in identifying the sense will be included. If the frequency size increases then some features can be eliminated if they do not occur more than the specified frequency.Hence a large window size and a small frequency size will work best for this classifier. The Classifier does not perform as well as expected but it has definitely improved. Data Conversion: ---------------- Data obtained from the Open Mind project had to be converted to a form that was similar to "line" data used in this assignment. Words used for performing more experiments: ------------------------------------------- attempt,act,captain,author,distribution,college Reasons for choosing these words --------------------------------- The words had to be chosen in such a way that 1) There was a high degree of agreement among the users 2) There was an even distribution among the various senses 3) There should be a resonable number of examples This task was much more difficult than I had expected.It was very difficult to find an ideal word where all the three conditions were satisfied.Typically if there was a high rate of agreement among two users,then there would probably be a single sense that dominated the distribution.It seemed as though, that as the number of senses for a word increased,then there was more disagreement among users. When I picked the words,I gave primary importance to even distribution of senses,because there is no use in trying to implement the Classifier to words where a single sense was dominant.Then I gave importance to rate of agreement among users.This was because,there had to be a decent agreement among users to really carry out this experiment successfully. Despite trying to choose words carefully,I did have problems with trying to get even distributions and also realised that in some cases one sense did dominate the distribution. Results for attempt: -------------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.3187353 0 2 0.3187353 0 5 0.3187353 2 1 0.3464726 2 2 0.3464726 2 5 0.3285621 10 1 0.4562341 10 2 0.4562341 10 5 0.4098266 25 1 0.5125634 25 2 0.5125634 25 5 0.4663571 Analysis: -------- The results obtained for attempt are not as good as what was obtained for the "line" data.I think the reason for this is that with a word like attempt,there is not correct sense for a word and the sense assigned depends on whether 2 users agree on the sense for that word.So maybe two users did agree on a word and there is still a chance that it could not be the correct sense.We would run into more problems with users disagreeing on word senses.The Classifier may get confused because if the senses are not correct,then it will not be able to effectively learn the senses. Results for act: ---------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.2964731 0 2 0.2964731 0 5 0.2964731 2 1 0.3287548 2 2 0.3287548 2 5 0.2987462 10 1 0.4194739 10 2 0.4194739 10 5 0.3783542 25 1 0.4987632 25 2 0.4984939 25 5 0.4564383 Analysis: --------- I think that "act" too has somewhat the same problems as for "attempt". If users agree on a sense,and lets say that they encounter a word with a similar sense elsewhere and then they assigned the wrong sense to it, then the classifier is trying to learn same features of different senses and then it is not able to really pick the correct sense. Results for captain: ------------------- Window Size Frequency Accuracy ------------ --------- -------- 0 1 0.3685735 0 2 0.3685735 0 5 0.3685735 2 1 0.3884673 2 2 0.3884673 2 5 0.3712632 10 1 0.4582536 10 2 0.4582536 10 5 0.4267484 25 1 0.5384638 25 2 0.5384638 25 5 0.5173849 Analysis: --------- This word seemed to perform well when compared to other words.It seemed to have a fairly decent distribution as well as rate of agreement between the two users.It did reach values much above the ones for "act" and "attempt". Results for author: ------------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.2846586 0 2 0.2846586 0 5 0.2846586 2 1 0.3278685 2 2 0.3278685 2 5 0.2967598 10 1 0.3528467 10 2 0.3528467 10 5 0.3175985 25 1 0.3956383 25 2 0.3956383 25 5 0.3759573 Analysis: --------- I think the problem with this word was that initially I thought that the agreement among the users was high.But when I obtained these accuracy values,I went back and took a look at it again.Then it seemed that the users did not have a very high rate of agreement and maybe this was the reason for the poor performance. Results for distribution: -------------------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.3746372 0 2 0.3746372 0 5 0.3746372 2 1 0.4028425 2 2 0.4028425 2 5 0.3935481 10 1 0.4485730 10 2 0.4485730 10 5 0.4362537 25 1 0.5153764 25 2 0.5153764 25 5 0.4925475 Anaylsis: --------- This word too performed pretty well.It reached high values of accuracy, though not when compared to the "line" data. Results for college: -------------------- Window Size Frequency Accuracy ----------- --------- -------- 0 1 0.2754947 0 2 0.2754947 0 5 0.2754947 2 1 0.2956481 2 2 0.2956481 2 5 0.2838104 10 1 0.3327492 10 2 0.3327492 10 5 0.3274722 25 1 0.3956673 25 2 0.3956673 25 5 0.3592640 Analysis: ---------- College did not perform too well.I think it performed very poorly when compared to the rest of the words. Conclusion: ----------- The Classifier performed very well for the "line" data.But somehow I felt that smoothing should have been applied to the "Test" file as well.This is because if a word does not appear in that particular sentence it does not mean that it is not a feature of the word.But I do not know if the Classifier would have been able to perform better if this was done.I did have problems trying to get even distributions and had to execute select.pl again and check for good distributions,but overall I think that this Classifier performs better. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Paul Gordon (1913768) Assignment 4 -- A continuing quest for meanings in li(f|n)e CS8761 11/11/02 Introduction The Naive Bayesian Classifier is a popular method for supervised word sense disambiguation. Its popularity results from its straightforward implementation and its resistance to noise. Indeed, in recent experiments (assignment 3), results were as high as 85% for a baseline of 54% and 4000 instances. In the following experiments, the Naive Bayesian Classifier will be tested under less ideal conditions. Baselines will be below 50%, and the number of instances in most cases will be between 100 and 150. The following experiments will determine how the classifier preforms under these circumstances. Methods The four files select.pl, feat.pl, convert.pl, and nb.pl operate the same as in assignment 3. Collectively, these files implement the Naive Bayesian Classifier. In addition, there are three new files. cull.pl was created to find "good" words to use in experiments with the classifier. The criteria for "good" are: 1) Reasonable agreement among taggers. 2) A reasonably balanced distribution 3) A reasonable number of instances. The usage is as follows: cat tag-file | cull.pl > tag.out. The output file contains the number of non-duplicated instances associated with each sense. The sense chosen for duplicate instances was the first encountered. This information was used to make an informed, but still subjective decision about how each word adhered to points 2 and 3. The output file also contains the number of duplicate instances, the number of duplicates that disagree in sense, and the fraction of disagreeing duplicates, to total duplicates. This information was used to determine "good" with respect to point 1. prep.pl creates the sense files. It retrieves each non-duplicate sentence associated with the chosen word, from the ids-to-sentence file, and assigns them to a file named after their sense tag. The usage is: prep.pl file word, where file is a number. If file is 1, then the tag file used is: /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.details. If file is 2, the tag file used is /home/cs/tpederse/CS8761/Open-Mind/OMWE-tagging. Word is the word for which sentences are collected. For example, ./prep.pl 1 art was the command to create the data files used with the first experiment in the results section. The last new file is a bash shell script, tagscript.bash. This file creates the TEST and TRAIN files, and then runs the the rest of the programs under each of the window size/frequency cutoff conditions, and reports the accuracy of each test. Hypothesis As discussed by Professor Pedersen in lecture, the Naive Bayesian Classifier is resistant to noise, and as a result should show an increase in accuracy as the window size increases, and a decrease in accuracy as the frequency cutoff increases. Results UMD-tagged word results art distribution: 35 31 25 19 discrepancy: 0 window size frequency cutoff Accuracy 0 1 .2727 0 2 .2727 0 5 .2727 2 1 .4242 2 2 .4545 2 5 .4545 10 1 .4242 10 2 .4242 10 5 .3939 25 1 .6061 25 2 .5454 25 5 .4848 neighborhood distribution: 56 59 2 discrepancy: .3699 window size frequency cutoff Accuracy 0 1 .5278 0 2 .5278 0 5 .5278 2 1 .5833 2 2 .6111 2 5 .5000 10 1 .6389 10 2 .5556 10 5 .5833 25 1 .5000 25 2 .5278 25 5 .5556 circumstance distribution: 10 62 62 discrepancy: .5105 window size frequency cutoff Accuracy 0 1 .4634 0 2 .4634 0 5 .4634 2 1 .6585 2 2 .6341 2 5 .5854 10 1 .6585 10 2 .6341 10 5 .6341 25 1 .5366 25 2 .4878 25 5 .5366 Full tag-file word results circuit distribution: 50 12 62 62 92 116 discrepancy: .4543 window size frequency cutoff Accuracy 0 1 .2941 0 2 .2941 0 5 .2941 2 1 .4286 2 2 .4538 2 5 .4034 10 1 .4118 10 2 .3782 10 5 .3866 25 1 .3950 25 2 .4454 25 5 .3445 audience distribution: 68 66 1 discrepancy: .3588 window size frequency cutoff Accuracy 0 1 .4634 0 2 .4634 0 5 .4634 2 1 .5854 2 2 .5854 2 5 .6584 10 1 .5366 10 2 .5854 10 5 .6098 25 1 .5854 25 2 .5610 25 5 .6098 discussion distribution: 50 54 discrepancy: .5455 window size frequency cutoff Accuracy 0 1 .3438 0 2 .3438 0 5 .3438 2 1 .4688 2 2 .5938 2 5 .5000 10 1 .5312 10 2 .5312 10 5 .5000 25 1 .5000 25 2 .6250 25 5 .6250 Conclusions As can be seen in the previous section, the results show deviations from the expected. Specifically, the accuracy sometimes decreases with increasing window size, and sometimes the accuracy increases with increasing frequency cutoff. possible reasons 1) Sparse data exaggerates the smoothing. When the number of instances is small, smoothing tends to give a disproportionately large probability to unseen events. 2) Because of the small number of instances, TEST and TRAIN are not always representative of the total distribution. In the previous assignment, with 4000 instances, it was fairly certain that if the division of instances was random, that TEST and TRAIN would have a distribution close to the original. However, with the Open Mind data, most sense words have between 100 and 150 non-duplicate instances. If these instances are then divided into TEST and TRAIN files, 70% going to TRAIN, and 30% going to TEST, then TEST could have as few as 30 instances. If there are four different, equally distributed senses for a word, that puts the number of instances for each sense at 7 or 8. This is a small enough number for random variations to change the distribution significantly. 3) The small number of instances also causes small changes in the number of correct to have a relatively large effect on accuracy. For 30 TEST sentences, a change of one in the number of correct, changes the accuracy by 3 1/3%. 4) The large amount of disagreement in sense tags may have contributed to some of the unexpected results. Discrepancy percentages were between 35 and 55 percent, which means that for nearly every other duplicate sentence, there was a disagreement between taggers. Art, which was done by a single tagger, still shows some unexpected numbers, but is generally in keeping with the hypothesized results. There does not seem to be a discernable trend in results. The two words that did the best, audience and discussion both contained two nearly equally distributed senses (audience has a third sense with only one instance), but neighborhood also had a similar distribution, and it did not do as well. The number of instances and the level of discrepancy didn't seem to matter much, but the range of both these values was limited. The results of these experiments are inconclusive beyond the observation that in general these results were much lower than in experiment 3. Rerunning these experiments with a larger number of instances, would help to narrow the source of the problem to either the first three points above, or to point 4, which is independent of the number of instances.+++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Prashant Jain CS8761 Assignment #4: A continuation of finding meaning in Li(f|n)e Introduction ------------ The objective of this assignment was to fix the code written in assignment 3 which was to "explore supervised appproaches to word sense disambiguation. We had been given the line data which contains six files.There were a number of instances per file but there was only one instance per line. What we had to do was to implement a Naive Bayesian Classifier to perform word sense disambiguation. Basically what it meant was that given an instance of 'line' in a file, we have to find out which would be the best sense that should be used with it.After fixing that part we had to use the Open Mind data given to us and find 3(6) interesting words in it and run our classifier on those words. Procedure --------- We had to create four files that basically implemented the Naive Bayesian Classifier.These were-: select.pl --------- This file divided the given data into TEST and TRAIN data after sense tagging it. Its provided with the argument of Percentage as well as the target.config file which contains the regular expressions to extract lines and instance ids. feat.pl ------- This file uses the TRAIN data and uses the window size and frequency count(provided by the user at the command line) to get feature words(words of interest) which are put in the FEAT file. convert.pl ---------- This file converts both the TRAIN and TEST data into its feature vector representations using the features provided by the feat.pl file. This is basically a binary representations of instances/senses and features. nb.pl ----- This is the file in which we implement naive bayesian classifier. We use both the files that we get after converting the TRAIN data and use them to assign sense values to the TEST data and check how accurately were we able to assign the correct senses. Observations of Experiments --------------------------- The following table describes the results we got from running our experiments over the various possibilities that had been given to us. -------------------------------------------- |window size | frequency cutoff | accuracy| -------------------------------------------- | 0 | 1 | 0.5442 | | 0 | 2 | 0.5442 | | 0 | 5 | 0.5442 | -------------------------------------------- | 2 | 1 | 0.7485 | | 2 | 2 | 0.7683 | | 2 | 5 | 0.7347 | -------------------------------------------- | 10 | 1 | 0.7790 | | 10 | 2 | 0.8125 | | 10 | 5 | 0.7930 | -------------------------------------------- | 25 | 1 | 0.7852 | | 25 | 2 | 0.7992 | | 25 | 5 | 0.7950 | -------------------------------------------- We notice that generally as we increase the window size, the accuracy of our naive bayesian classifier increases. Intuitively speaking, this should be expected since the more feature words we incorporate in our trials, the more the chances are that they would occur in the test data. And more the number of samples, more the probablity of assigning the correct sense. We also notice that as we increase the frequency of the words, there is a definite decrease in the accuracy. Again, intuitively, this should be expected. Becuase as we keep increasing the frequency count, if (as in our case) the stop words etc. havent been eliminated, they would have a higher chance of being the only ones which occur (since stop words like 'a', 'and', 'the' etc. appear with a lot more frequency than say a 'instrument' which would be a helpful hint in getting the sense of line as 'phone') rather than other more interesting yet a little lesser frequent words. The problem that was noticed in the previous assignment and which was subsequently fixed was that the select.pl file was picking up all the data from the files. This was fixed and after minor changes to both nb.pl and convert.pl the tests were run again giving the above mentioned results. These results fall within the acceptable range and hence the conclusion drawn that the classifier is working properly. The next step in to be done was to find interesting words in the Open Mind data and use our classifier on it. I checked the interesting words manually basing my acceptability criterion on the three things mentioned in the assignment-: 1. Has resonable rate of agreement between two taggers. 2. Has a balanced distribution of senses. 3. Has a resonable number of examples. I checked that the number of examples were atleast over a 100. Also I checked that the senses that have been tagged were of a balanced distribution. Mostly, if one sense was dominating the distribution, that word was omitted. I also looked at the rate of agreement between the taggers. Here was a complicated problem. Because, not always did the taggers agree. Also there were instances where we only had one tagged instance. In these cases the instance was taken as it is with the given sense. If there were more than one tagged instance and there was a disagreement, then the instance with the maximum number tags was chosen. If they had been tagged with equal number of each senses then the any of the senses was picked up randomly and assigned to that particular instance. I wrote a program to do all this. The name of the program is -: rawtofinal.pl It takes in as a command line argument, a specific word that we want to get the data for, and it processes the data given to us by open mind and converts it into the line data format. Usage: rawtofinal.pl [word] After doing this we get the data for that word separted into different files according to senses. The name of the file is the name of the sense contained in that file(like the line data). We can simply run our Naive Bayesian Classifier on this data and get our values. The words that were chosen from the given set of Open Mind data were-: 1. Aspect 2. Attempt 3. Author 4. Demand 5. Edge 6. Phase The test runs made on this data gave the following results-: ASPECT ------: Senses Considered: aspect10701 aspect10702 aspect10900 aspect10901 aspect12400 Results from nb.pl: Window Frequency Accuracy ---------------------------------------- 0 1 0.3627 0 2 0.3627 0 5 0.3627 2 1 0.2950 2 2 0.3442 2 5 0.3442 10 1 0.3114 10 2 0.3114 10 5 0.3770 25 1 0.3770 25 2 0.3971 25 5 0.3971 ---------------------------------------- Attempt ------- Senses Considered: attempt10400 attempt10402 Results from nb.pl: Window Frequency Accuracy ---------------------------------------- 0 1 0.7529 0 2 0.7529 0 5 0.7529 2 1 0.7741 2 2 0.8064 2 5 0.8387 10 1 0.8709 10 2 0.8709 10 5 0.8709 25 1 0.8805 25 2 0.8387 25 5 0.8387 ---------------------------------------- AUTHOR ------ Senses Considered: author11800 author11801 Results from nb.pl: Window Frequency Accuracy ----------------------------------------- 0 1 0.5806 0 2 0.5806 0 5 0.5806 2 1 0.7096 2 2 0.6501 2 5 0.6129 10 1 0.6250 10 2 0.6250 10 5 0.6250 25 1 0.6501 25 2 0.6501 25 5 0.6501 ---------------------------------------- DEMAND ------ Senses Considered: demand10400 demand10900 demand11000 demand12200 demand12600 results from nb.pl: Window Frequency Accuracy ----------------------------------------- 0 1 0.5909 0 2 0.5909 0 5 0.5909 2 1 0.5227 2 2 0.5227 2 5 0.5227 10 1 0.5909 10 2 0.5454 10 5 0.5000 25 1 0.6202 25 2 0.6202 25 5 0.6202 ---------------------------------------- EDGE ---- Senses Considered: edge10600 edge10601 edge10700 edge10701 edge11500 edge12500 results from nb.pl: Window Frequency Accuracy ----------------------------------------- 0 1 0.25 0 2 0.25 0 5 0.25 2 1 0.325 2 2 0.3 2 5 0.25 10 1 0.3 10 2 0.3 10 5 0.25 25 1 0.35 25 2 0.35 25 5 0.35 ----------------------------------------- PHASE ----- Senses Considered: phase10700 phase12600 phase12800 phase12801 results from nb.pl: Window Frequency Accuracy ----------------------------------------- 0 1 0.6060 0 2 0.6060 0 5 0.6060 2 1 0.4545 2 2 0.3939 2 5 0.4848 10 1 0.6060 10 2 0.5454 10 5 0.5151 25 1 0.6060 25 2 0.6060 25 5 0.6363 ----------------------------------------- We can observe from this data that mostly the accuracy that we get is in the very low ranges. This can be because of a lot of reasons. Some of the reasons that I can think of are as follows -: 1. The number of instances in the TEST and TRAIN data are very less. In line data we have over 4000 instances but here, for each word there would be maximum around 200 instances. The difference is evident. 2. There are instances which have been tagged only once, or in which difference between the users exists and we have to account for all of it. Hence the quality of the tagging is not as good as it can be. Hence that also brings down the accuracy level. Conclusion ---------- Hence I would like to conclude by saying that the naive bayesian classifier implemented by me gives pretty decent results for line data but gives variable results for Open-Mind data. References: ----------- Manning ,C.G. & Schutze Hinrich.2000. Foundations of Statistical Natural Language Processing.MIT Press.Cambridge Massachusetts. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Rashmi Kankaria Assignment no : 3 and 4 Due Date : 11th nov 2002. Objective : To explore supervised approaches to word sense disambiguation. and continue with the same for Open-Mind Data. Introduction : Supervised Disambiguation uses the set of training data which is already generated for a ambiguous word by sense tagging the words in training data and this labeled data is now used to disambiguate the word in next instance where it has occurred. This experiment is an attempt to implement Naive Bayesian Classifier. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 product ( with probability 0.5409) 0 2 product ( with probability 0.5409) 0 5 product ( with probability 0.5409) ---------------------------------------------------------------------- 2 1 0.6498 2 2 0.6366 2 5 0.6307 ---------------------------------------------------------------------- 10 1 0.7598 10 2 0.7566 10 5 0.7133 ---------------------------------------------------------------------- 25 1 0.8015 25 2 0.8155 25 5 0.7931 ---------------------------------------------------------------------- In a case where window size is 0, then the sense with maximum probability , here product will be assigned as default sense. This is empirically shown to be correct. Q> What effect do you observe in overall accuracy as the window size and frequency cutoffs change ? For a given 70-30 split of the training and test data, we can observe that the accuracy goes on increasing as we increase the window_size and that is reasonable because as we increase the window size and so the context, more features will be present to disambiguate the word sense and hence more chance of the frequency of the feature for a given sense and hence higher will be the probability to guess the sense with respect to that word. As we look at the table, we can draw certain conclusions about the pattern of accuracy for a given window_size and frequency cutoff. For constant frequency cutoff , the accuracy increase with increase in window_size considerably. Eg : The accuracy increases from 0.645 to 0.7731 as we increase the window_size from 2 to 25. Thus it is easy to conclude that this will be always true that with increase in window_size, the accuracy will increase. For constant window_size, however with increase in frequency cutoff,the accuracy does not increase linearly for all window sizes as we might have expected and i think that can be argued. With lower frequency cut off , for a given window size, only one feature word is taken into consideration and that is most of the time stop word. The stop words does not give any significant information about the word in the context which is very peculiar to the word and which will help it to disambiguate. On the other hand, if the window size is to large, there can be some extraneous information ( noise) which can get added as features but which on the contrary may not be so helpful. The most significant values here are wind_size = 10 and freq cutoff = 1 and window size = 25 and freq cutoff = 5. If you observe this,the accuracy of the first one is more than the second and this can help us to deicde that optimal freq cut off and window size. The major flaw with Naive Bayesian Classifier is that it considers all the features to be indepndant. This also affects the calculations of probabilities of features. Q> Are there any combinations of window size and frequency that appears to be optimal with respect to the others ? why? As argued above, there are few combinations of window_size and frequency cut off that appear optimal with respect to others. As we can observe, for a given window size, for increasing cutoff, the accuracy decreases. Also we observe that there is no significant change in accuracy as we change the window size from 10 to 25 so optimal window_size for this case can be 10 as just increasing the window size does not make significant change in the accuracy for 2 reasons. 1. Context might be still lesser than the window_size.in this case, increasing the window size does not make any difference. 2. Also the proximate context matters to disambiguate the sense and hence having larger window might not help much. As far as the frequency cut off is considered, are most optimal number will be within 2-5 as higher frequency features are most of the time stop words and are of not any help to disambiguate and a frequency as low as 1 will output all the feature words and most of them are not relevant or occur very infrequently with the word which we are going to disambiguate.any word within the gives a range will show strong association with the word we need to disambiguate over a large training set. This gives us the optimal values of window size and frequency cutoff. Assignment 4 : For this assignemnt, as per the requirements of good words i have choosen following words and found various values of accuracy for different combinations of frequency cuttoff and window size. The file used to convert the data from Open Mind format to Line data format is : open_to_line.pl. how to run : perl open_to_line.pl aspect.n where aspect.n is used to convert the aspect data into line data format. The words I selected are : 1. Unit 2. Aspect 3. Behavior 4. Circumstance 5. Bar 6. Detail 1. Unit : There are seven senses associated with unit. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 unit%1:14:00: ( with probability 0.4095) 0 2 unit%1:14:00: ( with probability 0.4095) 0 5 unit%1:14:00: ( with probability 0.4095) ---------------------------------------------------------------------- 2 1 0.3778 2 2 0.3556 2 5 0.4667 ---------------------------------------------------------------------- 10 1 0.4222 10 2 0.4222 10 5 0.3111 ---------------------------------------------------------------------- 25 1 0.4444 25 2 0.3778 25 5 0.3111 ---------------------------------------------------------------------- 2. Aspect : There are five senses associated with this. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 aspect%1 : 09:00: ( with probability 0.4183) 0 2 aspect%1 : 09:00: ( with probability 0.4183) 0 5 aspect%1 : 09:00: ( with probability 0.4183) ---------------------------------------------------------------------- 2 1 0.4048 2 2 0.2857 2 5 0.2619 ---------------------------------------------------------------------- 10 1 0.3095 10 2 0.3810 10 5 0.3810 ---------------------------------------------------------------------- 25 1 0.4048 25 2 0.4058 25 5 0.2857 ---------------------------------------------------------------------- ---------------------------------------------------------------------- 3. Behavior : There are 4 senses. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 behavior%1:04:00 ( with probability 0.5212) 0 2 behavior%1:04:00 ( with probability 0.5212) 0 5 behavior%1:04:00 ( with probability 0.5212) ---------------------------------------------------------------------- 2 1 0.5366 2 2 0.4390 2 5 0.4390 ---------------------------------------------------------------------- 10 1 0.4146 10 2 0.3902 10 5 0.2683 ---------------------------------------------------------------------- 25 1 0.5610 25 2 0.3415 25 5 0.4146 ---------------------------------------------------------------------- ---------------------------------------------------------------------- 4.circumstance : It has 4 senses Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 circumstance%1:26:01 ( with probability 0.6382) 0 2 circumstance%1:26:01 ( with probability 0.6382) 0 5 circumstance%1:26:01 ( with probability 0.6382) ---------------------------------------------------------------------- 2 1 0.5610 2 2 0.4878 2 5 0.5366 ---------------------------------------------------------------------- 10 1 0.5366 10 2 0.4390 10 5 0.4634 ---------------------------------------------------------------------- 25 1 0.6098 25 2 0.5854 25 5 0.5610 ---------------------------------------------------------------------- ---------------------------------------------------------------------- 5. Bar : It has 5 senses. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 bar%1:06:04 ( with probability 0.7678) 0 2 bar%1:06:04 ( with probability 0.7678) 0 5 bar%1:06:04 ( with probability 0.7678) ---------------------------------------------------------------------- 2 1 0.5656 2 2 0.5341 2 5 0.5656 ---------------------------------------------------------------------- 10 1 0.6138 10 2 0.5945 10 5 0.6012 ---------------------------------------------------------------------- 25 1 0.6821 25 2 0.6378 25 5 0.6076 ---------------------------------------------------------------------- ---------------------------------------------------------------------- 6. Detail : It has 5 senses. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 detail%1:10:00 ( with probability 0.5102) 0 2 detail%1:10:00 ( with probability 0.5102) 0 5 detail%1:10:00 ( with probability 0.5102) ---------------------------------------------------------------------- 2 1 0.4221 2 2 0.4178 2 5 0.4095 ---------------------------------------------------------------------- 10 1 0.4574 10 2 0.4421 10 5 0.3980 ---------------------------------------------------------------------- 25 1 0.5276 25 2 0.5132 25 5 0.4345 ---------------------------------------------------------------------- ---------------------------------------------------------------------- All the words i choose had more than three senses associated with it.This proved that the distribution was over a wide range of senses and thus more scope more even distribution. As we can see, "most" of the words had one sense much stronger than the other.Generally speaking,one sense occured with the probability of more that 0.5 while the rest of the senses together made to sunm the total probability to one. The decision to select a particular word was an interesting part of the assignement while the conversion of open-mind data into line data was rather slightly complicated because of mot of incompatibility between the two. All the words tagged were analysed properly for their number of instances,number of senses associated with it ,the distribution of senses probability and also the tagging quality.These many facotrs gave various alternatives to choose the words. As, the domain to select the word was the one which the umn-group tagged, many interesting words like act,cell etc.Also the tagging quality was found to be not very good as there was lot of disagreement amongst the taggers.Also there was an option of "unclear/unlisted" meaning which limited the choice further. However the above words were choosen peculiarly because all of many shades od senses and they have many meanings in different context.The words were tough to guess when tagged by taggers. As you can see, the accuracy of all the words choosen is on an average 40-55% which hints many hitches like a. Data for given sense,given word was less. b. There were many instances where taggers disagreed so rather than asking more taggers to tag till we decide the right sense, instances were put randomly. The probability of doing this was quiet high because most of the instances were tagged either twice or thrice. c. Also many instances chosen were having unclear/unlisted sense which needed to be ignored. This reduce the data further. In general,Niave Baysian Classifier works consistenly good for any window size and frequency cutoff however more data would have refined the accuracy. References : 1. Foundations of Statistical natural language processing by Christopher D Manning and Hinrich Schtutze. pg 235 - 239. 2. Programming Perl ( 3rd edition) by Larry Wall,Tom christiansen and Jon Orwant. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ NAME: SUMALATHA KUTHADI CLASS: NATURAL LANGUAGE PROCESSING DATE: 11 / 11 / 02 CS8761 : ASSIGNMENT 4 -> OBJECTIVE: TO EXPLORE SUPERVISED APPROACHES TO WORD SENSE DISAMBIGUATION AND TO APPLY NAIVE BAYESIAN CLASSIFIER TO GOOD WORDS WHICH OBTAINED FROM OPEN MIND TAGGING DATA. -> INTRODUCTION: -> WORD SENSE DISAMBIGUATION: ASSIGN A MEANING TO A WORD IN CONTEXT FROM SOME SET OF PREDEFINED MEANINGS(OFTEN TAKEN FROM A DICTIONARY). -> SENSE TAGGING: ASSIGNING MEANINGS TO WORDS. -> FROM A SENSE TAGGED TEXT WE CAN GET THE CONTEXT IN WHICH PARTICULAR MEANING OF A WORD IS FOUND. -> CONTEXT FOR HUMAN: TEXT + BRAIN -> CONTEXT FOR MACHINE: TEXT + DICTIONARY/DATABASE -> MAIN PARTS OF ASSIGNMENT: -> TO SELECT RANDOMLY A% OF INPUT TEXT AND PLACE THEM IN TRAIN FILE. REMAINING TEXT IS PLACED IN TEXT FILE. -> TO SELECT FEATURES FROM THE INPUT FILE (TRAIN FILE) WHICH SATISFY A FREQUENCY CUTOFF. -> TO CREATE A FEATURE VECTOR FOR EACH INSTANCES THAT ARE PRESENT IN BOTH TRAIN AND TEXT FILES. -> TO TO LEARN NAIVE BAYESIAN CLASSIFIER FROM THE OUTPUT OF THIRD PART OF ASSIGNMENT AND TO USE THAT CLASSIFIER TO ASSIGN SENSE TAGS TO TEST FILE. -> WHEN CREATING SENSE TAGGED TEXT YOU ARE BUILDING UP A COLLECTION OF CONTEXTS IN WHICH MEANINGS OF A WORD OCCUR. THESE CAN BE USED AS TRAINING EXAMPLES. -> THE BASIC PRINCIPLE INVOVLED IN WORD SENSE DISAMBIGUATION IS TO SELECT TH EVALUE OF THE SENSE THAT MAXIMISES THE PROBABILITY OF THAT SENSE OCCURING IN THE GIVEN CONTEXT (MOST LIKELY SENSE). -> WHILE USING "NAIVE BAYESIAN CLASSIFIER " WE ASSUME THAT THE FEATURES ARE CONDITIONALLY INDEPENDENT, THEY DEPEND ONLY ON THE SENSE. -> NAIVE BAYESIAN CLASSIFIER : S=ARGMAX P(SENSE) PRODUCT( P(C(i/SENSE)) SUCH THAT i=1 TO N WHERE S IS SENSE. -> REPORT: -> WE ARE RUNNING THE PROGRAMS WITH 12 COMBINATIONS OF WINDOW SIZE AND FREQUENCY CUTOFFS USING A 70 _ 30 TRAINING-TEST DATA RATIO. WINDOW SIZE FREQUENCY CUTOFF ACCURACY 0 1 0.5311 2 1 0.6991 10 1 0.8354 25 1 0.8409 0 2 0.5311 2 2 0.6970 10 2 0.8547 25 2 0.8962 0 5 0.5311 2 5 0.6639 10 5 0.8260 25 5 0.8755 -> OBSERVATIONS: -> WHEN THE WINDOW SIZE IS ZERO, NO MATTER WHAT'S THE VALUE OF FREQUENCY, ACCURACY IS ALMOST EQUAL. -> WHEN THE FREQUENCY IS KEPT CONSTANT AND WINDOW IS INCREASED, THE ACCURACY IS INCREASING. -> WHEN THE WINDOW SIZE IS INCREASING AND FREQUENCY IS INCREASING , THE ACCURACY IS INECREASING. -> THERE IS SOME RELATION BETWEEN FREQUENCY, WINDOW SIZE AND OVERALL ACCURACY, BECAUSE THE MEANING OF A WORD CAN BE GUESSED FROM IT'S SURROUNDING WORDS. -> GOOD WORDS: -> GOOD WORDS IN OPEN MIND DATA ARE COLLECTED BY CONSIDERING DISTRIBUTION OF SENSES, AGREEMENT AMONG THE TAGGERS AND NUMBER OF EXAMPLES. -> SELECTING GOOD WORDS PROJECT INVOLVED 3 MODULES. 1. tag.pl : THIS MODULE FINDS THE DISTRIBUTION OF SENSES, AGREEMENT BETWEEN TAGGERS AND NUMBER OF EXAMPLES FOR ALL THE WORDS. COMMAND LINE ARGUMENT : perl tag.pl ACCORDING TO THE OUTPUT OF tag.pl, I GOT "author, hope, future" AS GOOD WORDS. ***author***** Distribution of author%1:18:00:: : 72 Distribution of author%1:18:01:: : 136 avgagreement = 0.904761904761905 examples : 105 ***memory***** Distribution of memory%1:09:00:: : 59 Distribution of memory%1:09:01:: : 81 Distribution of memory%1:09:02:: : 104 Distribution of memory%1:09:03:: : 25 Distribution of memory%1:06:00:: : 16 avgagreement = 0.674496644295302 examples : 149 ***art***** Distribution of art%1:04:00:: : 35 Distribution of art%1:06:00:: : 31 Distribution of art%1:09:00:: : 25 Distribution of art%1:10:00:: : 19 avgagreement = 1 examples : 110 THESE THREE WORDS HAVE BETTER DISTRIBUTION THAN OTHER WORDS, GOOD RATE OF AGREEMENT AMONG THE TAGGERS AND GOOD NUMBER OF EXAMPLES. 2. text.pl: THIS MODULE PLACES INSTANCES OF A WORD IN THE APPROPRIATE SENSE FILE OF THE WORD. command line argument : perl text.pl ids-to-sentences cs8761-umd.full.detailed goodword 3. NAIVE BAYESIAN CLASSIFIER IS USED WITH THESE SENSE FILES TO FIND APPROPRIATE SENSE FOR A WORD IN A GIVEN CONTEXT. WINDOWSIZE FREQUENCY ACCURACY author: 0 1 0.8888 10 1 0.7777 25 2 0.7777 25 5 1.0000 art: 0 1 0.3333 10 1 0.3030 25 2 0.3030 25 5 0.3636 memory 0 1 0.3478 10 1 0.3160 25 2 0.3160 25 5 0.2666 +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ # ******************************************************************************** # experiments.txt Report for Assignment #4 Open Mind Tagging # Name: Yanhua Li # Class: CS 8761 # Assignment #4: Nov. 11, 2002 # ********************************************************************************* This assignment is to apply a Naive Bayesian Classifier to perform word sense disambiguation for Open Mind data. I carried out all experiments with 3 words --"arm", "circumstance", "manner". First we need to use sentence1.pl, sentence2.pl, sentence3.pl to convert instances in files words and sense to files with the same sense. So the commands are: sentence1.pl sense word1 sentence1.pl sense word2 sentence1.pl sense word3 After execute the command, we created arm1, arm2 , arm3, arm4, arm5, unclear and unlisted 7 files for "arm". Created circumstance1, circumstance2 , circumstance3, circumstance4, unclear and unlisted 6 files for "circumstance". Created manner1, manner2 , manner3, unclear and unlisted 5 files for "manner". We use these created file names as actual senses of instances in these files. We also need to change a little bit of code in nb.pl to give an array for senses. A change to match actual sense in nb.pl is also needed. Resulting Table for "arm" ****************************************************************** window size | frequency cutoff | accuracy 0 1 0.717948717948718 0 2 0.717948717948718 0 5 0.717948717948718 2 1 0.717948717948718 2 2 0.717948717948718 2 5 0.717948717948718 10 1 0.717948717948718 10 2 0.717948717948718 10 5 0.717948717948718 25 1 0.717948717948718 25 2 0.717948717948718 25 5 0.717948717948718 Resulting Table for "circumstance" ****************************************************************** window size | frequency cutoff | accuracy 0 1 0.525 0 2 0.525 0 5 0.525 2 1 0.525 2 2 0.525 2 5 0.525 10 1 0.525 10 2 0.525 10 5 0.525 25 1 0.525 25 2 0.525 25 5 0.525 Resulting Table for "manner" ****************************************************************** window size | frequency cutoff | accuracy 0 1 0.583333333333333 0 2 0.583333333333333 0 5 0.583333333333333 2 1 0.583333333333333 2 2 0.583333333333333 2 5 0.583333333333333 10 1 0.583333333333333 10 2 0.583333333333333 10 5 0.583333333333333 25 1 0.583333333333333 25 2 0.583333333333333 25 5 0.583333333333333 Because of the bug in my classifier, I always get the most command sense as assigned sense. So the accuracy is always the base line value no matter what window size and frequency cutoff are. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Aniruddha Mahajan experiments.txt Assignment 4 CS8761, Fall 2002. The objective of this assignment was to run Open-Mind tagged data through the programs written for assignment3. First step was to convert open mind data into format of line data -- as the previous code was written ad-hoc for line data. I wrote a program to perform the above-mentioned task. This is "pre.pl". pre.pl analyzes open-mind data and generates as many files for a tagged word as there are corresponding distinct senses. Once this was done, we had to run the data through the previous programs. Open Mind data did not produce results as good as line data. The table is given below as WF against accuracy obtained thru nb.pl. W | F | Captain | Dust | Share | Shape | Length | Energy | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 0 | 1 | 0.3118 | 0.5556 | 0.5373 | 0.2817 | 0.2759 | 0.6585 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 0 | 2 | 0.3118 | 0.5556 | 0.5373 | 0.2817 | 0.2759 | 0.6585 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 0 | 5 | 0.3118 | 0.5556 | 0.5373 | 0.2817 | 0.2759 | 0.6585 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 2 | 1 | 0.5439 | 0.5556 | 0.6866 | 0.3944 | 0.2931 | 0.6098 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 2 | 2 | 0.4839 | 0.5556 | 0.7015 | 0.3944 | 0.3621 | 0.6341 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 2 | 5 | 0.4731 | 0.4447 | 0.7164 | 0.4085 | 0.3103 | 0.6585 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 10| 1 | 0.5914 | 0.6222 | 0.6865 | 0.4507 | 0.2586 | 0.5610 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 10| 2 | 0.5806 | 0.5556 | 0.6120 | 0.4225 | 0.3276 | 0.5366 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 10| 5 | 0.5377 | 0.5333 | 0.6716 | 0.4225 | 0.3276 | 0.5610 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 25| 1 | 0.5914 | 0.6222 | 0.6269 | 0.3944 | 0.3276 | 0.6098 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 25| 2 | 0.6022 | 0.4889 | 0.6220 | 0.3803 | 0.3276 | 0.4878 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | 25| 5 | 0.5591 | 0.4889 | 0.6567 | 0.3944 | 0.3276 | 0.5366 | ----|---|---------|--------|---------|---------|--------|--------|----------------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (*) Selection of 'good' words ------------------------------ I selected these word particulary because there is a moderate level of areement among the tagging done by different persons. Also the senses have a distributed i.e. evenly spread weight. Finally all of these also have >100 number of examples. We can particularly observe that the words which have a wider distribution among the senses (captain, shape, length) show similarity towardw line data. We can recall that line data accuracy values increase with increase in Window Size .. but decrease within the window as Frequency Cutoff increases. (*) pre.pl ----------- I wrote pre.pl to format Open-Mind data into Line form. pre.pl accepts 2 files as input .. the sentences from open mind and the instance ids and sense in another file. pre.pl produces output in the form of n number of files where n is the no. of real senses of the tagged word. open mind data also has 'unclear' or 'none of the above' as options. these instances are put in "word." while other files maybe "word.0703" "word.0700" "word.0701" etc. perl pre.pl length1 length2 where -- length1 is file containing tagged open-mind sentences length2 is file containing instance-ids & corresponding sense +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ Typed and Submitted by: Ashutosh Nagle Assignment 4: A Continuing Quest for Meanings in Li(f|n)e. Course:CS8761 Course Name: NLP Date: Nov 11, 2002. +-----------------------------------------------------------+ | Introduction | +-----------------------------------------------------------+ This assignment has the following three parts: 1) Data Creation ~~~~~~~~~~~~~~~~ Here I tagged 564 word provided by the Open Mind Word Expert project. 2) Naive Bayseian Classifier Improvement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The necessary changes have been made to the classifier. Tabulated below are the observations of the experiments that I performed on "line-data". +===============================================+ | W | F | Accu | |===============|===============|===============| | 0 | 1 | 0.5297 | |---------------|---------------|---------------| | 0 | 2 | 0.5297 | |---------------|---------------|---------------| | 0 | 5 | 0.5297 | |===============|===============|===============| | 2 | 1 | 0.7452 | |---------------|---------------|---------------| | 2 | 2 | 0.7347 | |---------------|---------------|---------------| | 2 | 5 | 0.7146 | |===============================================| | 10 | 1 | 0.8103 | |---------------|---------------|---------------| | 10 | 2 | 0.8039 | |---------------|---------------|---------------| | 10 | 5 | 0.7733 | |===============================================| | 25 | 1 | 0.8135 | |---------------|---------------|---------------| | 25 | 2 | 0.8111 | |---------------|---------------|---------------| | 25 | 5 | 0.7918 | +===============================================+ Here for the window size of 0, there is no feature no matter what the frequency cut-off is. Therefore we get the same accuracy for all the 3 frequency cut-offs. Also it is fairly low because, the classifier has hardly any data to learn from. So always assigns the most frequent sense to every occurence of the ambiguous word. For the window size 2, the accuracy suddenly jumps to early 70's (70%), as is expected. As the frequency cut off increases, the number of features available decreases and so the classifier learns from increasingly less data. And so the accuracy falls down little bit as the cut offs are increased. Same phenomenon is observed in case of window sizes 10 and 25. As window size becomes 10, the accuracy reaches as high as 80% and remains (almost) in 80's even when the window size is made 25. This is obvious because, we start considering the global context with local as we increase the window size. But with this increase we never discard the previous data. So the classifier gets additional data to learn from. The output files of the above experiments are available under /home/cs/nagl0033/NLP/assignment4/ Here there are 12 subdirectories, whose names are of the form "ij". It means that the subdirectory "ij" contains the files of experiments performed with window size 'i' and frequency cutoff 'j'. 3) Applying The NB Classifier to the Open-Mind Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Choosing the "Good Words" :--> My program "getGoodWords" selects good words for me. It applies the following criteria while doing the selection. *The word should occur at least 75 times. (With 2 taggers 150 times) *The agreement between the 2 taggers should be atleast 40% *The word should not have less than 2 senses. *Finally, the word should have evenly distributed tags. I have arrived on the above values by trial and error. These look reasonable and also give some 10 words from the big pool. From these 10 my program selects top 6 based on the level of agreement between the taggers. The last criteria above is calculated by taking the standard deviation of the number of senses. If the standard deviation is less than the 30% of the mean (average) number of tags per sense then the word is accepted else rejected. Also my program throws, all the words marked with "unlisted", "unclear" or "empty" tags. Converting The Data-Format :--> The program "convert" is used to do this. HOW TO RUN THE PROGRAM. 1. To Retrieve The Good Words :--> Typing ./getGoodWords executes the program. No parameter is required. It creates a temporary file "tempFile3" in the same directory. Also it expects the data to be available in "/home/cs2/tpederse/CS8761/Open-Mind/cs8761-umd.full.detailed" The program gives small summary about the good words, shows the six finally chosen by it and the agreement percentage among the taggers for each. It also creates a file "tempFile3" in the directory where it runs and stores the instance ids and the corresponding tags for all of the 6 seleted words. 2. To Convert The Data :--> This is done by the program "convert". Like the previous program, this program also does not take any input parameter. It reads the "tempFile3" created by "getGoodWrods" program, gets the instance ids of the selected instances, reads the corresponding sentences from the file "/home/cs2/tpederse/CS8761/Open-Mind/ids-to-sentences" and converts them to the line-data-format. After this is executed, 6 sub-directories are created in the directory where it was executed, each corresponding to one of the good words. Each of these directories contains as many files as the number of senses of the the good word it represents. 3. Now the classifier is to be run as usual on this data. But here, the name of the files are different. For this, I have made copies of the 4 files of the classifier (select.pl, feat.pl, convert.pl and nb.pl). Each set allows only the filenames of the corresponding good word. Tabulated below are the observations of my experiments +====================================================================+ | | | Accuracy | | W | F |____________________________________________________| | | |captain|completion|distribution|enemy | past |shape | |=======|=======|=======|==========|============|======|======|======| | | 1 |0.4615 |0.6176 |0.4848 |0.3333|0.5000|0.2963| | |-------|-------|----------|------------|------|------|------| | 0 | 2 |0.4615 |0.6176 |0.4848 |0.3333|0.5000|0.2963| | |-------|-------|----------|------------|------|------|------| | | 5 |0.4615 |0.6176 |0.4848 |0.3333|0.5000|0.2963| |=======|=======|==================|============|======|======|======| | | 1 |0.4103 |0.1765 |0.5152 |0.1212|0.3529|0.4815| | |-------|-------|----------|------------|------|------|------| | 2 | 2 |0.3333 |0.2059 |0.5758 |0.1515|0.4412|0.4074| | |-------|-------|----------|------------|------|------|------| | | 5 |0.2821 |0.2353 |0.4848 |0.3030|0.4118|0.3333| |=======|=======|=======|==========|============|======|======|======| | | 1 |0.3077 |0.6765 |0.2727 |0.3333|0.2941|0.2963| | |-------|-------|----------|------------|------|------|------| | 10 | 2 |0.2821 |0.6471 |0.1818 |0.3333|0.3529|0.3704| | |-------|-------|----------|------------|------|------|------| | | 5 |0.2821 |0.5000 |0.3939 |0.3333|0.2647|0.1852| |=======|=======|=======|==========|============|======|======|======| | | 1 |0.2821 |0.7059 |0.1818 |0.1515|0.2353|0.1852| | |-------|-------|----------|------------|------|------|------| | 25 | 2 |0.2821 |0.5882 |0.1818 |0.1818|0.2353|0.1481| | |-------|-------|----------|------------|------|------|------| | | 5 |0.2821 |0.5588 |0.2121 |0.0909|0.3235|0.1111| +====================================================================+ I see from the table that the classifier performs better for the word "completion" than any of the other words. The most strange part is that the accuracy decreases as the window size is increased. The word "completion" has much more data than that for any other word above. This, I think, is the reason for better performance of the classifier for "completion" than for any other word. In any word, for a given window size, as the frequency cut-off is increased, the accuracy decreases and this, I feel, is reasonable because with increasing cut-off, the number of features decreases and so the classifier learns from less and less data. CONCLUSION: Thus for the classifier to work well, the data fed for training must be large. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ CS 8761 NATURAL LANGUAUGE PROCESSING FALL 2002 ANUSHRI PARSEKAR pars0110@d.umn.edu ASSIGNMENT 3 : HOW TO FIND MEANINGS IN LIFE. Introduction ------------ Many times we come across words which have several meanings (or senses)and so there can be an ambiguity about their meaning. Word sense disambiguation deals with assigning meanings or senses to ambiguous words by observing the context or the neighboring words. In this assignment we attempt to disambiguate various meanings of the word line by using sense tagged line in which six different senses of the word line have been identified and the data is divided into seperate files for each sense. Programs implemented --------------------- select.pl : This program tags every instance of the target word (here line) according to the file in which it is and then randomly divides the data into TRAIN and TEST parts. feat.pl : This module identifies the feature words associated with the word to be disambiguated (here 'line'). The range of feature words can be controlled by changing the window size as well as the frequency of occurance of the feature word.The feature words are obtained from the TRAIN file only. convert.pl: The input to this file is the TRAIN or TEST file and the list of feature words.Each instance in file is converted into a series of binary values that indicate whether or not each feature word listed has occurred within the specified window around the target word in the given instance. nb.pl : This program learns a Naive Bayesian classifier from the feature vector of TRAIN and use that classifier to assign sense tags to the occurences of target word in the TEST file. It tag assigned is the sense having the maximum prob (sense|context).The context word are assumed to be conditionally independent. Expected results ---------------- 1. When the window size is zero we do not have any feature words or context to guess the sense of the target word. The only guideline we have is the probability of each senseand we assign the same sense which has the maximum probability to all the instances of the test data. The accuracy thus obtained will not dependent on the frequency. 2. As we increase the window size, the no of feature words will increase. We expect an increase in accuracy with increase in window size because for a larger window would mean that we are considering a wider context. Hence feature words which were ignored for a small window size will also be used and might give a better accuracy. 3. Increasing the frequency cutoff will again reduce number of feature words which may reduce the accuracy. 4. If we run the program on just 50 % of training data then it should reduce the accuracy as the classifier will have less data to learn from. 5. If the target word has lesser no of senses then a better accuracy can be expected because the word can be has to be assigned a sense which is choosen from a smaller number. Results -------- ----------------------------------------------------- |window size | frequency cutoff | accuracy for | | | | 70% Train data | ----------------------------------------------------- | | | | | 0 | 1 | 0.5568 | | 0 | 2 | 0.5568 | | 0 | 5 | 0.5568 | -------------|--------------------|-----------------| | 2 | 1 | 0.7478 | | 2 | 2 | 0.7301 | | 2 | 5 | 0.7139 | -------------|--------------------|-----------------| | 10 | 1 | 0.8268 | | 10 | 2 | 0.8276 | | 10 | 5 | 0.7792 | -------------|--------------------|-----------------| | 25 | 1 | 0.8426 | | 25 | 2 | 0.8485 | | 25 | 5 | 0.8187 | ----------------------------------------------------- Conclusions ----------- 1. The accuracy increases as we take more number of feature words. Thus, when we increase the window size or take a low frequency cutoff we are considering a more detailed and context wider context. Hence the accuracy increases. 2. However a very large window size does not help much because the target word is most of the times unrelated to distant words which appear as feature words when the window size is large. 3. The memory and time required for a high window and low frequecy cutoff are quite high as compared to the increse in accuracy that they give. This should be considered while choosing a optimal combination on window size and frequecy cutoff. 4. The window size of 10 and the frequecy cutoff of 1 or 2 seem to give a good accuracy for the line data. However, we cannot make general statements on basis of this, since the senses of line are distinct and have a seperate set of feature words for each sense. ASSIGNMENT 4 : A CONTINUING QUEST FOR MEANINGS IN LIFE Introduction ------------- This assignment deals with word sense disambiguiation (using naive bayesian classifier ) of words which were tagged in the open mind project. The open mind data was divided into two files, one contained the instance id and the sentences while the other had the instance id and senses assigned to the words by the user. A program was required to be implemented to get the open mind data into seperate files according to the sense assigned. Program implemented ------------------- A program was implemeted to do the above mentioned task. The input to the program are two files, word_ln (contains the id and the sentences for a given word) and word_tag (contains the id and the senses). THe output is a number of of files sorted according to the senses. e.g. for the word aspect, perl id.pl aspect_ln aspect_tag output files: aspect.0 aspect.1 aspect.2 aspect.3 aspect.4 These files can be used as the input to select.pl along with a aspect.config file. Results ------------------------------------------------------------------------------------------ |window size | frequency cutoff | ASPECT | PHASE | DISTRIBUTION | | | | senses = 5 | senses = 4 | senses= 5 | -----------------------------------------------------------------------------------------| | | | | | | | 0 | 1 | 0.50000 | 0.71428570 | 0.50000 | | 0 | 2 | 0.50000 | 0.7142857 | 0.50000 | | 0 | 5 | 0.50000 | 0.7142857 | 0.50000 | -------------|--------------------|-----------------|-------------------|----------------| | 2 | 1 | 0.531250 | 0.7142857 | 0.4736842 | | 2 | 2 | 0.468750 | 0.7142857 | 0.4473684 | | 2 | 5 | 0.500000 | 0.714285 | 0.5000000 | -------------|--------------------|-----------------|------------------------------------| | 10 | 1 | 0.375000 | 0.6190476 | 0.3947368 | | 10 | 2 | 0.468750 | 0.7142857 | 0.3157895 | | 10 | 5 | 0.500000 | 0.714285 | 0.4210526 | -------------|--------------------|-----------------|------------------------------------| | 25 | 1 | 0.250000 | 0.6190476 | 0.5789474 | | 25 | 2 | 0.187500 | 0.7142857 | 0.4210526 | | 25 | 5 | 0.375000 | 0.6666667 | 0.4736842 | ------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------ |window size | frequency cutoff | SHARE | DUST | CAPTAIN | | | | senses = 4 | senses = 3 | senses=7 | -----------------------------------------------------------------------------------------| | | | | | | | 0 | 1 | 0.4545455 | 0.1363636 | 0.3695652 | | 0 | 2 | 0.4545455 | 0.1363636 | 0.3695652 | | 0 | 5 | 0.4545455 | 0.1363636 | 0.3695652 | -------------|--------------------|----------------|-------------------|-----------------| | 2 | 1 | 0.6666667 | 0.5000000 | 0.4130435 | | 2 | 2 | 0.4242424 | 0.5454545 | 0.4565217 | | 2 | 5 | 0.4545455 | 0.1363636 | 0.2608696 | -------------|--------------------|----------------|-------------------------------------| | 10 | 1 | 0.6363636 | 0.3181818 | 0.4130435 | | 10 | 2 | 0.5757576 | 0.3636364 | 0.3260870 | | 10 | 5 | 0.4545455 | 0.3636364 | 0.3043478 | -------------|--------------------|----------------|-------------------------------------| | 25 | 1 | 0.5454545 | 0.2727273 | 0.3695652 | | 25 | 2 | 0.5151515 | 0.4545455 | 0.3478261 | | 25 | 5 | 0.4242424 | 0.3181818 | 0.2826087 | ------------------------------------------------------------------------------------------ Conclusions: 1. The acuracy for all the words choosen from the open mind data is much lower than the line data. This maybe because the senses for words chosen were not clearly distinct but were interrelated, whereas, for the line data the meanings were completely different. This made disambiguation easier. Also the data from open mind project was tagged by a number of users. Hence, the correctness of the tagged data needs to be checked. The number of instances in line were very high than the open mind data, therefore more training data was available in case of line data to learn from.. 2. The window size of 2 appears to give better accuracy in case of open mind data. 3. No clear trends in accuracy with respect to change in window size and frequecy are observed in case of open mind data. 4. The naive bayesian classifier gives poor results for open mind data since the context for each sense is not distinct and some times difficult disambiguate even manually. +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ REPORT -------- ----------------------------------------------------------------------------- Prashant Rathi Date -11th Nov. 2002 CS8761 - Natural Language Processing ----------------------------------------------------------------------------- Assignment no.4 ------------------ Objective ::To continue your exploration of supervised approaches to word sense disambiguation ----------------------------------------------------------------------------- Observations on the open-mind data : ---------------------------------- I considered two files from the open-mind data cs8761-umd.full.detailed and ids-to-sentences.The first file contained all the information about the tagging done by different users and the second file contained the actual sentences corresponding to the instance-ids.I wrote a program (splitter.pl) to split the cs8761-umd.full.detailed file into various files with a single file for each word.Similarly, I separated the ids-to-sentences file into smaller data files corresponding to every word as this reduces the file access time as ids-to-sentences is a very large file and it takes time to access (separator.pl). Now we had to select "good" words depending upon the following criteria- a. Has reasonable rate of agreement among the two taggers b. Has somewhat balanced distribution of senses. c. Has a reasonable number of examples. Now to find such words , I have written a program (getstats.pl).This program outputs all the information about each word and the output can be seen in the RESULTS file attached. This file helps us to identify the "good" words by studying the different characteristics. By studying this file, I picked up a few good words which were- share,captain,edge,dust,distribution,shape Steps to be followed to run for word say 'shape': ------------------------------------------------ 1.Use 'getstats.pl tagshape >RESULTS' to get information about the word shape.This is optional.(res file contains the script for this). 2 Use 'separator.pl shape' to get data from ids-to-sentences all containing 'shape' instances. 3.Use 'splitter.pl' to create the file tagshape from cs8761-umd.full.detailed We need this tagshape further. 4.Use 'createdata.pl tagshape shape' to get all sense files corresponding to shape instance. 5 Then run all assignment3 files on these sense files.We may have to change the target.config files before running these files. a. select.pl b.feat.pl c.convert.pl d.nb.pl The observations made for these words follow next. ------------------------------------------------------------------------------- OBSERVATIONS: ------------ Experiments were carried out with window sizes of 0,2,10 and 25 and frequency cut-offs of 1,2 and 5.For these 12 combinations of frequency and window sizes and 70-30 training-test data ratio, following observations were made - share ----- --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.4667 | 0 | 2 | 0.4667 | 0 | 5 | 0.4667 | 2 | 1 | 0.5556 | 2 | 2 | 0.4889 | 2 | 5 | 0.4667 | 10 | 1 | 0.4667 | 10 | 2 | 0.4667 | 10 | 5 | 0.4222 | 25 | 1 | 0.4444 | 25 | 2 | 0.4667 | 25 | 5 | 0.4444 | -------------------------------------------- captain ------- --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.3148 | 0 | 2 | 0.3148 | 0 | 5 | 0.3148 | 2 | 1 | 0.3889 | 2 | 2 | 0.3148 | 2 | 5 | 0.2222 | 10 | 1 | 0.3889 | 10 | 2 | 0.3519 | 10 | 5 | 0.2593 | 25 | 1 | 0.3704 | 25 | 2 | 0.3519 | 25 | 5 | 0.2407 | -------------------------------------------- edge ---- --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.2500 | 0 | 2 | 0.2500 | 0 | 5 | 0.2500 | 2 | 1 | 0.3750 | 2 | 2 | 0.2500 | 2 | 5 | 0.3000 | 10 | 1 | 0.3000 | 10 | 2 | 0.3000 | 10 | 5 | 0.3250 | 25 | 1 | 0.2750 | 25 | 2 | 0.2500 | 25 | 5 | 0.2000 | -------------------------------------------- dust ---- --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.3333 | 0 | 2 | 0.3333 | 0 | 5 | 0.3333 | 2 | 1 | 0.4444 | 2 | 2 | 0.3333 | 2 | 5 | 0.2222 | 10 | 1 | 0.4167 | 10 | 2 | 0.4167 | 10 | 5 | 0.3333 | 25 | 1 | 0.3333 | 25 | 2 | 0.4167 | 25 | 5 | 0.4167 | -------------------------------------------- distribution ------------ --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.4000 | 0 | 2 | 0.4000 | 0 | 5 | 0.4000 | 2 | 1 | 0.3500 | 2 | 2 | 0.3500 | 2 | 5 | 0.3500 | 10 | 1 | 0.4000 | 10 | 2 | 0.4000 | 10 | 5 | 0.4500 | 25 | 1 | 0.4000 | 25 | 2 | 0.4500 | 25 | 5 | 0.3750 | -------------------------------------------- shape ----- --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.3077 | 0 | 2 | 0.3077 | 0 | 5 | 0.3077 | 2 | 1 | 0.3590 | 2 | 2 | 0.3846 | 2 | 5 | 0.3846 | 10 | 1 | 0.4077 | 10 | 2 | 0.3846 | 10 | 5 | 0.3077 | 25 | 1 | 0.3590 | 25 | 2 | 0.4103 | 25 | 5 | 0.3333 | -------------------------------------------- - The results of the open-mind data observed show very low accuracy values. - This may be due to the fact that data contained limited instances. - In the cases where there was a disagreement between users tagging I pick up the senses randomly, this might result in inappropriate distribution among senses. - Also, sometimes only one user has tagged a particular word (art) so there was agreement totally but such words do not qualify as good words. - The quality of tagging may not be satisfactory. ------------------------------------------------------------------------------- CONCLUSION: ----------- So considering the reasons mentioned above, I think the results observed for the Open-Mind data show low accuracy. This shows the data is not as good as the line data which had many more instances and tagging was of much high quality. -------------------------------------------------------------------------------+++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ CS 8761: Assignment 4 - Naive Bayesian Classifier Repaired Due: 11/11/2002 12pm Sam Storie 11/4/2002 +------------------------------------------------------------------------+ | Methods | +------------------------------------------------------------------------+ Introduction: A Bayesian classifier (BC) is a technique used to classify a set of events into "bins" and then use certain information to classify new events into one of those bins. In the context of word sense disambiguation we use a set of words surrounding a "target word" that we're trying to determine the sense of that word. These surrounding words are considered to be "features" of the word and are treated as potential identifying items concerning the sense of the target word. This assignment deals with determining the sense of the word "line". To do this we examine a set of data where the sense of line has been predetermined and use this data to "train" our classifier. Then we use the information from the training data to try to determine the sense of some unseen testing data. The details of this process is described in subsequent sections. Process: This assignment is broken into four seperate programs. The first program examines a set of pre-tagged instances for line. To ensure our classifier isn't developing a bias this program seperates this data into training and testing data. The classifier is developed only using the training data, and isn't shown the testing data until the actual performance testing is done. This program randomly selects a certain percentage of the instances to be placed into a file called TRAIN and the remainder are placed into a file called TEST. This ensures that our classifier won't develop a bias. The second stage of the classifier is a program that determines which words occur within a certain window of the target word in the training instances. Recall from the introduction we are considering these contexual words to be some sort of indicator of the sense for that instance. Of course there are some overlaps for common words, but this is taken into account during this process. To help identify words that appear often within this window we can set a frequency cutoff to eliminate uncommon words. The result of this stage is a list of words that occur within a certain window size, and that also occur a certain number of times. This stage is only performed on the *training* data, since using the testing data would introduce a bias towards the features present in the testing data. This would obviously nullify any results we ultimately obtained from that data. The third program uses the list of words from the previous stage to create feature vectors for all the instances of training and testing data. A feature vector is simply a list of 1's and 0's to indicate whether a given word occured within the window used to generate the features. If a given instance had a 1 for the word "IBM", then this would mean that "IBM" appeared within the window of "line" for this instance. These feature vectors are generated for both the training and testing data. The final stage is to actually create the classifier and attempt to assign senses to the training data. To do this we examine the feature vectors for the training data and determine how often each feature occured with each sense. From this data we can compute the probability associated for a sense based on what words occur within the window size specified in the earlier stages. Then when we examine the feature vector for a test instance we can determine which sense is most likely (based on the training instances used) given the features that are in the window. The probability of each sense is computed by multiplying the probabilities of each feature occuring, given the sense. Then we simply select the sense with the largest resulting probability. The final step is to apply a sense with the classifier and check it against the actual sense. The final accuracy (number correct / total number of instances) is reported to the user. Technically, using probabilities of the features occuring with the target word is a very involved process, but we are implementing a *naive* BC. This means that we treat the probability of each feature occuring with the target word as conditionally independent. If we did not make this assumption we would need to consider the probability of not just the features seperately, but rather occuring in the sequence they actually appear. Zipf's law shows us that once you start considering sequences of more than 3-4 words the probability is very hard to determine. This is simply because those sequences probably do not occur very often and it's hard to assign how likely they should appear. This is a non-trivial distinction, but impressively still generates some useful results. As a final note, in the case of a feature not occuring within a training instance, which would give a probability of zero, we use Wittman-Bell smoothing to give those cases a probability to use in the calculations. +------------------------------------------------------------------------+ | Result from classifier improvements | +------------------------------------------------------------------------+ I ran this set of four programs with a combination of window sizes and frequency cut-offs. Per the assignment, the initial set of instances was split into 70% training data and 30% were set aside for testing. The initial data was split up only once, and the resulting TRAIN and TEST files were repeatedly used for each of these tests. The results are summarized as such: Window size | Frequency cutoff | Accuracy +------------------------------------------+ 0 1 .5350 <-- Baseline case (worst) 0 2 .5350 0 5 .5350 0 10 .5350 2 1 .7502 2 2 .7213 2 5 .6827 2 10 .6217 10 1 .8129 10 2 .8008 10 5 .7637 10 10 .7454 25 1 .8457 <-- Best result 25 2 .8156 25 5 .7802 25 10 .7655 +------------------------------------------+ As you can see in the table, the best accuracy obtained was about 85% and this was consistent across several trials. I found it rather interesting to see how the window size affected the accuracy across the trials. For a window size of 0 I would expect that the accuracy would match the frequency of the most common sense. This is because, with no features to help in the classification, the only data that can be used is how often each sense occured. The result is that every testing sense is tagged with the same sense. In the case of line, the "product" sense occurs about 54% of the time (2218 out of 4149 instances), and this is close to the results obtained for the window size of 0. It's much more interesting when the window sized is increased, even to a size of 2. Intuitively this might seem to not impact the classifier much, since the most common words that are within 2 of the target are probably words like "the", "at", "for", etc. However, it's clear that even though these words might occur often next to the target, there's something else helping the classifier. I believe things like collocations, or common bigrams/trigrams are providing some features that are clearly helping to classify the test instances. Traditionally these features are called "local features", and provide some much needed features to help the classifier. Once the window size is increased to 10 or 25 the classifier is able to include "topical features", or those that do not occur immediately next to the target word. In terms of "line", these types of features included "IBM" for the product sense, or "telephone" for the cord sense. Also, observe how the results actually decrease as the frequency cut-off is increased. It's easy to imagine that removing common words might help the classification task, but what's really happening is it's reducing the amount of data the classifier has to use. This helps show how these types of classifiers are fairly immune to "noise" in the data. It's also tempting to just increase the window size and include every word possible, but there is trade-off here because the computational time required for large amounts of data can grow quite large. As an example it required an AMD Athlon 1700+ computer with 512M of RAM about 10 minutes to process the case of W=25 and F=1. This might not seem too bad, but this was substantially longer than the smaller window size cases, and didn't radically improve the results. This is also implementation dependent, but the tradeoff still exists. On a final note, the meaning of a 85% success rate is difficult to guage because for a human this task is often trivial. However, given that the actual tests are based solely on the instance itself, I think it's rather impressive that such a simple technique can get a correct result 85% of the time. I imagine with some tweaking of the various parameters of the experiment, or even the specific data being used, might produce slightly better results. Of course, these results are based on the amount of training data available (about 3600 instances in this case) and the breadth of examples that data covers. Still, with these limitations in mind, the Naive Bayesian Classifier is a simple, and clearly useful technique when performing word sense disambiguation. +-------------------------------------------------------------------------+ | References and More Information | +-------------------------------------------------------------------------+ (1) Original assignment on Dr. Pederson's web page - http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a3.html (2) Revision assignment on Dr. Pederson's web page - http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a4.html +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ *********************************************************************************************** * Refer to README.TXT for listing of various files attached with assignment 4 and their usage * *********************************************************************************************** \\\|/// \\ ~ ~ // ( @ @ ) **********************************-oOOo-(_)-oOOo-************************************ CS 8761 - Natural Language Processing - Dr. Ted Pedersen Assignment 4 : A CONTINUING QUEST FOR MEANINGS IN LI(F|N)E Due : Monday , November 11, Noon Author : Anand Takale ( ) ***************************************-OoooO-*************************************** .oooO ( ) ( ) ) / \ ( (_/ \_) ------------- Objectives : ------------- To continue your exploration of supervised approaches to word sense disambiguation. -------------- Specification: -------------- (1) Fix your Naive Bayesian Classifier from Assignment 3. If you got less than 10 on this assignment you have a problem that requires fixing. Once you have your classifier, rerun the line data as described for Assignment 3 and rewrite your report reflecting new results. (2) Identify three "good" words in Open Mind data that we created as a class. ( This is found in file /home/cs/tpederse/CS8761/Open-Mind-cs8761-umd.full.details (3) Once you have identified your words, convert the Open Mind data into the form of the line data so you can use your assignment 3 classifier. Run your classifier on that data and comment on your results. ------------------------------------------------------------ Part I -- Fixing Naive Bayesian Classifier from Assignment 3 ------------------------------------------------------------ As the results of Naive Bayesian Classifier for asssignment 3 , I was getting the accuracy around 0.7 range which was less than expected. I tried to correct the earlier classifier i.e. I tried to see if there were any bugs or logical errors in nb2.pl which was submitted as a part of tar file for Assignment 3. However this became quite a tedious job. As a result I recoded the entire nb.pl and have placed in the tar file of assignment 4 as nb2.pl. The earlier nb.pl is also included in the tar file if required for reference. After recoding nb.pl as nb2.pl I realized that there was a mistake with the smoothing part of nb.pl as well as there was some minor correction of denominator of the classifier which did not affect the accuracy but affected the associated probability. These corrections have been made to the classifier. Here the same report for Assignment 3 is attached along with the new values recorded. ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************* \\\|/// \\ ~ ~ // ( @ @ ) **********************************-oOOo-(_)-oOOo-************************************ CS 8761 - Natural Language Processing - Dr. Ted Pedersen Assignment 3 : HOW TO FIND MEANINGS IN LIFE Due : Friday , November 1, Noon Author : Anand Takale ( ) ***************************************-OoooO-*************************************** .oooO ( ) ( ) ) / \ ( (_/ \_) Objectives : To explore supervised approaches to word sense disambiguation. To create ---------- sense-tagged text and see what can be learned from the same. Specification : Sense tag some text and implement a Naive Bayesian classifier to perform ------------- word sense disambiguation. ---------------------- Part I : Data Creation ---------------------- Tag 500 instances/sentences on Open Mind Word Expert website. login id : Anand Project : CS8761-UMD Instances Tagged : 515 There was no output to be turned in from Open Mind sense tagging. -------------------------------------------------- Part II : Naive Bayesian Classifier Implementation -------------------------------------------------- This assignment mainly deals with word sense disambiguation. The problem to be solved is that many words have several meanings or senses. For such words given out of context there is ambiguity about how they are to be interpreted. Thus our task is to do the disambiguation i.e to determine which of the senses an ambiguous word is invoked in a particular use of the word. This assignment required us to calculate the Naive Bayesian Classifier. In short it learns a Naive Bayesian classifier from TRAIN data and use the classifier to assign sense tags to TEST data. The idea of Bayes classifier is that it looks at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does not do feature selection. Instead it collects evidence from all features. Implement a Naive Bayesian Classifier to perform word sense disambiguation. Perform your experiments with the "line" data. The assignment required the implementation of four modules: (1) select.pl -- program to construct TRAIN and TEST data (2) feat.pl -- program to extract features. (3) convert.pl -- program to build the feature vector representation (4) nb.pl -- program to learn a Naive Bayesian Classifier from TRAIN.FV and use that classifier to assign sense tags to TEST.FV Our main task was to create a Naive Bayesian Classifier from the TRAIN data and then disambiguating the words from the TEST data. We had to perform disambiguation of "line" data. For this the six files containing different senses of line were combined and lines were read randomly from any of the files and put into TRAIN and TEST files. The TRAIN file consisted of 70% data while the TEST file consisted of 30% files. After creating the TRAIN and TEST files , the features which had a frequency greater than the cutoff frequency were selected. After observing the extracted features it was noted that the conjunctions , prepositions and the determiners like a , an , and , the occured almost in all the cases. To be more specific about the task of disambiguation we need to find more dependent collocations and try to eliminate all the 'stop' words like the conjunctions , determiners etc. so that the accuracy of the classifier improves. ------------ Experiments ------------ Experiment with window sizes of 0,2,10 and 25. Use frequency cutoffs of 1,2 and 5. Run your classifiers with all 12 possible combinations of window size and frequency cutoff using a 70-30 training-test data ratio. Report the accuracy values that you obtain for each combination in a table. ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.5397 0 | 2 | 0.5397 0 | 5 | 0.5397 ------------------------------------------------------------ 2 | 1 | 0.7197 2 | 2 | 0.7213 2 | 5 | 0.7012 ----------------------------------------------------------- 10 | 1 | 0.8143 10 | 2 | 0.8231 10 | 5 | 0.7907 ----------------------------------------------------------- 25 | 1 | 0.8446 25 | 2 | 0.8437 25 | 5 | 0.8283 ------------------------------------------------------------ ----------------------------------------------------------------------------------- What effect do you observe in overall accuracy as the window size and the frequency cutoffs change ? ---------------------------------------------------------------------------------- After observing the data recorded in the tables above we come to the following conclusions : (1) When the window size is zero i.e. there are no features available to train the Naive Bayesian Classifier , the sense that occurs the most in the TRAIN data is assigned as the calculated sense for all the instances of the TEST data. In case of line data the accuracy observed for window size of 0 was 0.5397. This was the probability of the sense that occured the most. From this we can conclude that even without having any prior knowledge i.e. without training the Naive Bayesian Classifier. So this is the lowest accuracy that can be obtained from this classifier. So we see for the line data that the minimum accuracy obtained will be 0.5397. (2) For a window size of 2 the value of accuracy increases considerably. We see that the value of accuracy goes upto 0.7197 for a cutoff frequency of 1. What we can conclude from this is that when the window size is one we can predict sense of almost 64% of the test instances.So we see that disambiguation becomes easier when we have some context rather than having no context at all. One more observation that can be made at this point is that according to Zipfs' Law most of the bigrams occur rarely while few occur frequently.We observe the same effect as that of Zipf's law. For the window of size 2 we observe that as the cutoff frequency goes on increasing the accuracy goes on decreasing. This is what is observed by Zipf's law. This proves that most of the bigrams occur very few times. So as the frequency cutoff goes on increasing the chances of all the words occuring becomes less. This is what we observe from the table tabulated above. (3) Again observing values of the accuracy for different window sizes we observe that the value of accuracy goes on increasing till a certain point before it fall off again. Also the phenomenon of decrease in accuracy with increase in window size is observed. Now in both these cases i.e with increase in window size and with increase in cutoff frequency we see that there is a considerable decrease in accuracy. This observed fact can be explained as follows: when we try to increase the window size from 0 to 25 the number of features present goes on increasing. As the number of features goes on increasing the Naive Bayesian Classifier undergoes lot of learning. But all the features are not important to train the classifier.The stop words i.e .all the conjunctions and determiners actually are of no use to train the classifier. What we are looking is a more solid feature which is closely connected to the tagged word. By finding such solid features which are closely connected to the tagged words it becomes much more easier to train the classifier. In short the decrease in accuracy can be considered as consequence of increase in noise words which get added to the feature vector. Now these noise words are introduced as a result of increase in window size or increase in cut-off frequency. ------------------------------------------------------------------------------------ Are there any combinations of window size and frequency that appears to be optimal with respect to the others ? Why ? ----------------------------------------------------------------------------------- From the observations i.e. from the table we observe that the optimum value of accuracy is 0.8446. This value of accuracy is observed when the window size is 25 and the cutoff frequency is 1. This optimum value of the window size and the cutoff frequency are specific to this data. (1) The range of optimum value of window size will be the same. The value of accuracy draws a curve with respect to the window size. Initially as the window size increases the accuracy also increases till it reaches an optimum value from where it again starts to fall off. This is due to the fact that as the window size increases the number of noise words also increase which reduce the overall accuracy of the classifier. This is because the more connected tokens occur fewer times than the stop words i.e. conjunctions , determiners etc. (2) As far the relation between the cutoff frequency and the accuracy goes , it is observed that as the cut-off frequency increases the accuracy goes on decreasing. This observed fact is due to the reason that as the cutoff frequency increases the more dependent words ( or we can say more specific words to that sense or context ) which occur very few times to be eliminated from being considered as features. This leaves only the stop words i.e the conjunctions , prepositions , determiners etc. as the features. Now these stop words can't train the classifier as the specific dependent tokens do. These stop words occur in almost all the instances equi-probably and hence it becomes more difficult to disambiguate the sense causing the accuracy to go lower. So in short we can say that a midrange window size and lower cutoff frequency would most probably give the most optimum value for accuracy. We also conclude that a window size of 0 gives the least accuracy.So if we go on increasing the window size beyond 25 , the accuracy will increase till a certain window size after which it will again start falling slowly. However whatever be the window size and the cutoff frequency , accuracy won't fall below the accuracy observed when window size=0. ----------------- References : ----------------- (1) Programming Perl - O-Reilly Publications (2) Foundations of Statistical Natural Language Processing - Christopher D. Manning and Hinrich Schutze (3) Problem Definition and some related concepts and issues - Pedersen T. ( www.d.umn.edu/~tpederse ) ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************* --------------------------------- Part ( II ) -- Finding Good Words --------------------------------- A good word is one that has the following properties-- (1) Has a reasonable rate of agreement among the two taggers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After analyzing the Open-Mind data from the file /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.details the following things were observed-- (a) In all there were 13204 instances that were tagged. Out of these there were 7350 unique instances. (b) Among the 7350 unique instances --- 2089 instances were tagged by single user 4794 instances were tagged by two users 381 instances were tagged by three users 55 instances were tagged by four users 22 instances were tagged by five users 9 instances were tagged by six users After observing the above figures it was important to decide aa to what could be the criteria for deciding agreement between two users ? As there were 2089 instances that were tagged by single person itself. So should the words that were just tagged by a single person be considered as good words ? To solve the problem of agreement I eliminated the words which were tagged by single user as being the words that were in agreement. There were many words like distance , chair , bar , authority , channel , objective, shoulder and art that were only tagged by single users. However these words were not eliminated from being considered as good words. As for the words which were tagged by two or more users , even if two users agreed on some particular sense , it was considered as agreement. i.e for the instances which were tagged by 6 users , even if two or more users agreed on some particular sense it was considered as agreement. Agreement for different words was found by using a program called agreement.pl. While calculating the agreement the program calculates the agreement ratio of the instances where only two or more users have tagged the same instance. The program disregards all the instances that a single user has tagged. The case where a single user has tagged an instance can be considered as a 100% agreement or 0% agreement as criteria for a GOOD word was agreement among two taggers. However in case of instances which were tagged by single user , it cannot be decided whether it is not possible to decide the agreement ratio. As I have not considered the cases of single user tagging an instance as agreement I have got less agreement ratios. However if the ratios are considered with the single user tagged instances being considered as agreement the ratios would increase considerably ..... as of now it is around 0.40 to 0.50 mark. However if instances tagged by single user are considered as agreement the ratio goes above till around 0.80 mark. After looking at the output of agreement.pl we see that following words can be classified as possible (satisfy at least 1/3 criterias ) GOOD words : chapter , past , cofee , captain , team , edge , unit , song , tissue , sun , skill , college , blood , chance , structure , lip (2) Has a some what balanced distribution of senses. At the very least avoid the case where a ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ single sense dominates the distribution. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The second criteria for finding a good word is that the sense distribution for that word should be equally distributed. i.e. every sense of the word should contribute towards the total number of instances. Also we should not consider any word that has a distribution in which one sense dominates the other senses. The word for which one sense dominates the other senses were removed directly from consideration of being good words. Some of the words that were removed directly because of this property of GOOD word were (i) lip ---> 16, 1, 183, 5, 20 (ii) chair ---> 1, 16, 1 To find the sense distribution of each and every word a perl program called sense_dist.pl was written and run on Open-Mind data. The program outputs the word and count of how many times different senses occur. After looking at sense distribution of all the words some of the words satisfying this criteria of GOOD word were as follows : difficulty , distribution , memory , captain , shape , hope , argument , demand , edge , volume , aspect , art , behaviour , unit and length. (3) Has a reasonable number of examples. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The third criteria for finding a GOOD word is that the word should a reasonable number of instances. Each word should have considerable number of instances. So the words having more number of instances should be considered as words satisfying the third criteria of being a GOOD word. A perl program instances.pl outputs the word and count of instances of each word. The words having reasonable number of examples are as follows : pain , chapter , dust , attemp , future , past , coffee , atmosphere , arc , radiation , difficulty , song , enemy , tissue , depth , memory , captain , shape , team , onset , author , factor , argument , newspaper , phase , demand , edge , college , behaviour , blood , chance , structure. ----------------------------------------------------------------------------------------------------- ****************************************** GOOD WORDS ********************************************** ----------------------------------------------------------------------------------------------------- After considering all the three criterias for finding the good words the following words were found to be amongst the better words in the cs8761-umd project.Here the agreement ratio also considers the instances tagged by single users as agreement. ----------------------------------------------------------------------------------------------------------- word number of instances sense distribution agreement ----------------------------------------------------------------------------------------------------------- difficulty 271 48 , 80 , 80 , 63 0.7214 distribution 291 45 , 35 , 2 , 127 , 74 , 8 0.7783 memory 287 16 , 25 , 81 , 2 , 104 , 59 0.6394 captain 366 55 , 27 , 40 , 21 , 57 , 40 , 1 ,11 , 114 0.8196 shape 241 77 , 6 , 24 , 42 , 18 , 24 , 20 , 30 0.6784 hope 227 21 , 53 , 1 , 4 , 6 , 91 , 51 0.8366 argument 269 3 , 111 , 63 , 7 , 1 , 84 0.7342 demand 263 8 , 57 , 37 , 100 , 1 , 60 0.6102 edge 188 19 , 38 , 30 , 34 , 4 , 54 , 9 0.6273 volume 331 42 , 12 , 9 , 41 , 43 , 136 , 42 , 6 0.7284 aspect 255 46 , 1 , 10 , 121 , 65 , 12 0.5922 behaviour 270 35 , 1 ,63 , 1 , 60 , 110 0.6962 length 195 34 , 36 , 47 , 22 , 53 , 3 0.6462 art 110 31, 19 , 35 , 25 1.0000 The word "art" has agreement of 1.0000 because only one user tagged all the instances. After going through the outputs of all the three files i.e. instances.pl , agreement.pl and sense_dist.pl , I came up with the following six GOOD words - (1) art (2) shape (3) captain (4) edge (5) length (6) distribution. These six words satisfied all the three criteria. They were above the threshold value. This threshold value can be taken as per the user's choice. In this case the treshold conditions that I considered was number of instances being above 100 , a good sense distribution i.e words having equally distributed more than 4 senses and a good agreement ratio ( in this case considered to be about and more than 0.6 ). The word art has been tagged by only one user. But it has a good sense distribution. So it is considered as being a GOOD word. ----------------------------------------------------------- Running The Naive Bayesian Classifier on the Open-Mind data ----------------------------------------------------------- After the Open-Mind data was converted into line format the Naive Bayesian Classifier was executed on the Open-Mind data. Complete Procedure for finding accuracy after word has been identified as GOOD word: ------------------------------------------------------------------------------------ (1) Say word "art" has been identified as a GOOD word. (2) Run pickup.pl as follows pickup.pl art This command seperates all the instances of art from Open-Mind data from ids-to-sentences file. The format of the pickup.pl is as follows instance_ID sense actual_instance (3) Care should be taken that if older file of same sense exist , they should be deleted. This is because I open the file in append mode. Execute seperate.pl as follows seperate.pl art It will seperate all the instances of "art" into different files according to the senses of the instances. The file names have been created by the sense id of the sense of instance that is stored in that file. Say for art the filenames are art%1:04:00: art%1:06:00: art%1:09:00: art%1:10:00 (4) Execute select.pl as follows select.pl 70 art.config art%1:04:00: art%1:06:00 art%1:09:00: art%1:10:00 (5) Execute feat.pl as follows feat.pl TRAIN W F > FEAT (6) Execute convert.pl as follows cat FEAT | convert.pl W TRAIN > TRAIN.FV cat FEAT | convert.pl W TEST > TEST.FV (7) Execute nb3.pl on TRAIN.FV and TEST.FV nb.pl TRAIN.FV TEST.FV > TAGGING The Naive Bayesian Classifier was run on all the six words for combination of window size of 0,2,10,25 and frequency cutoffs of 1 , 2 and 5. The following tables show the observation. Word : Edge ----------- ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.22 0 | 2 | 0.22 0 | 5 | 0.22 ------------------------------------------------------------ 2 | 1 | 0.29 2 | 2 | 0.28 2 | 5 | 0.25 ----------------------------------------------------------- 10 | 1 | 0.42 10 | 2 | 0.36 10 | 5 | 0.29 ----------------------------------------------------------- 25 | 1 | 0.52 25 | 2 | 0.49 25 | 5 | 0.49 ------------------------------------------------------------ Word : length ------------- ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.25 0 | 2 | 0.25 0 | 5 | 0.25 ------------------------------------------------------------ 2 | 1 | 0.28 2 | 2 | 0.28 2 | 5 | 0.27 ----------------------------------------------------------- 10 | 1 | 0.36 10 | 2 | 0.38 10 | 5 | 0.29 ----------------------------------------------------------- 25 | 1 | 0.45 25 | 2 | 0.44 25 | 5 | 0.39 ------------------------------------------------------------ Word : distribution ------------------- ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.52 0 | 2 | 0.52 0 | 5 | 0.52 ------------------------------------------------------------ 2 | 1 | 0.57 2 | 2 | 0.59 2 | 5 | 0.56 ----------------------------------------------------------- 10 | 1 | 0.59 10 | 2 | 0.59 10 | 5 | 0.60 ----------------------------------------------------------- 25 | 1 | 0.56 25 | 2 | 0.63 25 | 5 | 0.58 ------------------------------------------------------------ Word : art ----------- ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.36 0 | 2 | 0.36 0 | 5 | 0.36 ------------------------------------------------------------ 2 | 1 | 0.39 2 | 2 | 0.39 2 | 5 | 0.36 ----------------------------------------------------------- 10 | 1 | 0.48 10 | 2 | 0.49 10 | 5 | 0.45 ----------------------------------------------------------- 25 | 1 | 0.53 25 | 2 | 0.29----> looks like some wrong value 25 | 5 | 0.50 ------------------------------------------------------------ Word : shape ------------ ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.34 0 | 2 | 0.34 0 | 5 | 0.34 ------------------------------------------------------------ 2 | 1 | 0.41 2 | 2 | 0.39 2 | 5 | 0.38 ----------------------------------------------------------- 10 | 1 | 0.48 10 | 2 | 0.48 10 | 5 | 0.44 ----------------------------------------------------------- 25 | 1 | 0.57 25 | 2 | 0.53 25 | 5 | 0.46 ------------------------------------------------------------ Word : captain ---------------- ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.31 0 | 2 | 0.31 0 | 5 | 0.31 ------------------------------------------------------------ 2 | 1 | 0.43 2 | 2 | 0.43 2 | 5 | 0.41 ----------------------------------------------------------- 10 | 1 | 0.41 10 | 2 | 0.40 10 | 5 | 0.44 ----------------------------------------------------------- 25 | 1 | 0.55 25 | 2 | 0.55 25 | 5 | 0.54 ------------------------------------------------------------ From the above observations I came upto following conclusions (1) The conclusions that came up as result of observation and analysis of line data seems to be true for the Open-Mind data too. As the window size increases the accuracy increases and as the frequency cutoff increases the accuracy tends to decrease. (2) However as observed above the accuracy is way too lower as compared to line data. The reasons cited for such low accuracy could be due to wrong tagging done by the Open-Mind users. Another reason for low accuracy could be due to the fact that unclear or no sense had been tagged to the instance.Another reason for low accuracy is that the number of instances for one word considering all the senses is very less as compared to the line data. We can obviously see that the Open-Mind data for a particular word is very less as compared to the line data which was a big chunk of data. (3) It can also be observed that all the GOOD words behaved in much the way as expected i.e. accuracy increases as window size increases , accuracy decreases as frequency cuto-off decreases and the accuracy never falls below a certain value ( This value is the value when the window size is 0 ). (4) Thus we observe that the classifier behaves as expected and that Naive Bayesian Classifier can be considered as one of the GOOD supervised approaches to word sense disambiguation. (5) The words selected were only from the cs-861 project. If the entire Open-mind data would have been considered we could have definitely got better GOOD words and better agreement ratios and still better accuracies. --------------- References : -------------- 1. For Naive Bayesian Classifier - refer to report of Assignemnt 3 2. Problem Definition and related concepts - Pedersen T. ( www.d.umn.edu/~tpederse ) 3. Open Mind data ( www.teach-computers.org ) +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++ #******************************************************************** # Name: Nan Zhang # Data: 11/10/02 # Class: CS8761 #******************************************************************** window size | frequency cutoff | accuracy 0 1 .5466 0 2 .5466 0 5 .5466 2 1 .7363 2 2 .7178 2 5 .6905 10 1 .8207 10 2 .8127 10 5 .7830 25 1 .8344 25 2 .8280 25 5 .7910 Basicly, the accuracies are continuous. As window size increases, accuracy increases. And as frequency cutoff increases, accuracy decreases. So when window size is 25 and frequency cutoff is 1, I get the maximal accuracy .8344. But it is not obviously larger than other accuracies whose window size is 25.