++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ================================================== Author: Amine Abou-Rjeili Date: 10/30/2002 Filename: experiments.txt Class: CS8761 Assignment: 3 - How to Find Meanings in Li(f|n)e ================================================== Included files: --------------- select.pl Randomly divides the data into TRAIN and TEST partitions feat.pl Extracts features to be used for training and testing. Features must be extracted from the TRAIN partition and NEVER from the TEST partition convert.pl Converts instances into feature vector representation. Used to convert the TRAIN and TEST instances into a set feature vectors nb.pl Runs a Naive Bayes Classifier to learn from TRAIN and then classify the instances from TEST according to the learned model target.config Contains system configurations experiments.txt Results and analysis of experiments (this file) Analysis: --------- This file contains the experiments and analysis that were carried out to test the Naive Bayes Classifier on the line data. The experiments were run using a 70-30 percent ratio to split the data into TRAIN and TEST partitions respectively. The following combinations of window sizes and frequency cutoffs were used: Window Size: 0 2 10 25 Frequency cutoffs: 1 2 5 I run two sets of experiments as follows: 1) Split the data into TRAIN and TEST partitions once and then run the experiments using these partitions. This experiment was carried out to compare the performance of the different combinations using the same TRAIN and TEST data (constant environment). The results of these experiments are summarized in TABLE 1. 2) For each of the combination of window size and frequency cutoffs, a different set of TRAIN and TEST partitions was used. This means that these partitions differ in every run because they are randomly distributed from the entire line data. This experiment was carried to see how different partitions can affect the performance of each experiment. As can be observed from the data below, the performances of the 2 experiments are very similar and so the changing environments did not produce a great impact. However, it can be seen that performance was degraded slightly with different partitions (in experiment 2). For example, consider the following: Window Size| Frequency cutoff| Accuracy Experiment 1) 10 5 Accuracy: 0.8232 [Total number of correct 1024 out of 1244] Experiment 2) 10 5 Accuracy: 0.8063 [Total number of correct 1003 out of 1244] It can be seen that 1 has a better accuracy ~2%. One explanation of this is that it so happened that the TRAIN data was more reflective of the TEST data in case 1 as compared to case 2. Here, I must note that these 2 sets of experiments do not necessarily imply that having a constant environment is always better. It seems that it is a matter of finding the set of TRAINING data that is diverse enough that it reflects with most accuracy any set of test data. The results of experiment 2 are summarized in TABLE 2 below. TABLE 1 ------- window size|frequency cutoff|accuracy 0 1 Accuracy: 0.5273 [Total number of correct 656 out of 1244] 0 2 Accuracy: 0.5273 [Total number of correct 656 out of 1244] 0 5 Accuracy: 0.5273 [Total number of correct 656 out of 1244] 2 1 Accuracy: 0.7508 [Total number of correct 934 out of 1244] 2 2 Accuracy: 0.7564 [Total number of correct 941 out of 1244] 2 5 Accuracy: 0.7412 [Total number of correct 922 out of 1244] 10 1 Accuracy: 0.7460 [Total number of correct 928 out of 1244] 10 2 Accuracy: 0.8135 [Total number of correct 1012 out of 1244] 10 5 Accuracy: 0.8232 [Total number of correct 1024 out of 1244] 25 1 Accuracy: 0.7572 [Total number of correct 942 out of 1244] 25 2 Accuracy: 0.8400 [Total number of correct 1045 out of 1244] 25 5 Accuracy: 0.8344 [Total number of correct 1038 out of 1244] --------------------------------------------------------- TABLE 2 ------- Running experiments with different TRAIN and TEST each time window size|frequency cutoff|accuracy 0 1 Accuracy: 0.5338 [Total number of correct 664 out of 1244] 0 2 Accuracy: 0.5579 [Total number of correct 694 out of 1244] 0 5 Accuracy: 0.5346 [Total number of correct 665 out of 1244] 2 1 Accuracy: 0.7291 [Total number of correct 907 out of 1244] 2 2 Accuracy: 0.7347 [Total number of correct 914 out of 1244] 2 5 Accuracy: 0.7235 [Total number of correct 900 out of 1244] 10 1 Accuracy: 0.7195 [Total number of correct 895 out of 1244] 10 2 Accuracy: 0.7910 [Total number of correct 984 out of 1244] 10 5 Accuracy: 0.8063 [Total number of correct 1003 out of 1244] 25 1 Accuracy: 0.7170 [Total number of correct 892 out of 1244] 25 2 Accuracy: 0.8103 [Total number of correct 1008 out of 1244] 25 5 Accuracy: 0.8119 [Total number of correct 1010 out of 1244] From the above data we can observe, that the window size alone does not generally improve accuracy but it is also dependent upon the frequency cut-off. For example, taking the following instances: a) 10 5 Accuracy: 0.8232 [Total number of correct 1024 out of 1244] b) 25 1 Accuracy: 0.7572 [Total number of correct 942 out of 1244] we can see that the accuracy actually decreased when we increased the window size but decreased the frequency cutoff, with the exception of a window size of 0 in which the frequency cutoff is not used since there are no features. So by increasing the window size to 2 then we will get an increase in performance irrespective of the frequency cutoff. This general rule makes sense since if we have a small frequency cutoff, then we will have more features which just happen to be there by chance and do not contribute much to the sense of the target word. However, if we have a higher frequency cutoff then we get the words that occur more frequently around the target word in the context since the ones that occur infrequently are not considered. On the other hand, if we set a very high frequency cutoff then we might get less features in the vector which is similar to falling to a smaller window size. This can be seen from the following example taken from TABLE 1: 25 2 Accuracy: 0.8400 [Total number of correct 1045 out of 1244] 25 5 Accuracy: 0.8344 [Total number of correct 1038 out of 1244] We can see here that a frequency cutoff of 2 performed slightly better than a cutoff of 5 (with window size 25). This is not the case for TABLE 2 due to the different TRAIN and TEST partitions. So a model with smaller frequency cutoff could generalize better than one with cutoff of 5, depending on the TRAIN data. From the above experiments, we can see that the optimal combination of window size and frequency cutoff is Window Size: 25 Frequency Cutoff: 2 Accuracy: 84% This is the highest accuracy performed, although the 25 5 combination performed slightly worse, 83.4%. This can be explained from the discussion above where we do not want to go too big in terms of window size and too high in terms of frequency cutoff because then we will get into the case of overfitting the model. So we need to find a middle term for best performance. In the experiments carried out the combination 25 2 turned out to be the optimal, as reported by TABLE 1. On the contrary we can observe that the results from TABLE 2 disagree with this and place the combination 25 5 as better than 25 2, although very slightly so. This is attributed to the different TRAIN and TEST partitions, as mentioned above. As a conclusion, it can be observed from the data that with each window value, the accuracy starts low and then increases as the frequency cutoff increases. However in some cases the accuracy drops as the cutoff increases from 2 to 5. REFERENCES: ----------- Manning and Schutze. 2000. Foundations of Statistical Natural Language Processing :235-239 Mitchell. 1997. Machine Learning :ch.6 ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Name : Nitin Agarwal Date 1-Nov-02 Class : Natural Language Processing CS8761 =============================================================== Objective To create sense tagged text and explore supervised approaches to the word sense disambiguation. Procedure Let us consider a huge corpus in which one word occurs in all the lines and could have various meanings. We have to split this corpus into 2 sets. We analyze one set using Statistical Natural Language Processing techniques to get the training data. Using this data we try to figure out the meaning of the target word in the test data. After this we compare the meaning of the word obtained using this method with the actual meaning and find the accuracy of this method. The value of the accuracy could be anywhere between 0 and 1. The closer the value is to 1 the better the method is. Observations I ran the test for all the 12 possible combinations of windows. However, unfortunately I got the same value for accuracy in all the 12 cases. Every time I ran the program I got the accuracy as 0.2370. This suggests that the program written by me is flawed somewhere. In my opinion it is most likely nb.pl, though I could figure out the mistake. However, from the output that I got I noticed that the probability for many instances was 0. Again the likely reason is that, I just multiplied all the terms. Had I added the logarithms of all the terms and then calculated the antilog of the sum, I could have got different value. Nevertheless, I could not find an antilog function in Perl because of which I could calculate the probability using this method. window size | frequency cutoff | accuracy 0 1 0.2370 0 2 0.2370 0 5 0.2370 . . . . . . . . . 25 5 0.2370 ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Kailash Aurangabadkar Assignment # 3 How to find meanings in li(n|f)e -------------------------------------------------------------------------------- The objective of the assignment is to explore supervised approaches to word sense disambiguation. In this assignment a sense tagged text is created and implement a Naïve Bayesian classifier to perform word sense disambiguation. -------------------------------------------------------------------------------- Word Sense Disambiguation: The task of disambiguation is to determine which of the senses of an ambiguous word is invoked in a particular use of the word. -------------------------------------------------------------------------------- Process: In this assignment the Naïve bayesian classification algorithm is used to assign the score to every instance in Test data for every sense possible for the word. In this classifier the central idea is to look around ambiguous words in a large context window. Each content word adds on information to the disambiguation of the target word. We first find the Probability vector from Train data for each sense and feature combination. If we do not see a word in that context then we apply Witten Bell smoothing to find the probability of that event. Then we find the probability of a content word for every sense in the Test data window by using the Naïve Bayes assumption. The sense with which we get maximum probability for that instance from Test data is assigned as the sense of that instance. The accuracy of the algorithm is then computed by comparing the actual sense of that word in that line with the sense assigned by us. -------------------------------------------------------------------------------- The assignment consists of implementing four programs, which are:- 1. Select.pl:- This program divides a sense tagged corpus of text into Training data and Test data. 2. Feat.pl:- This program finds the features in the specified window around the target word. It also checks for the frequency of the features to be more than the cutoff specified. 3. Convert.pl:- This program gives us the feature vector table which shows whether the features obtained using feat.pl are present or not in the input file. 4. Nb. Pl:- This program assigns sense tags to untagged data from test. It does this by using the Naïve Bayesian algorithm, smoothing its value by using Witten - Bell smoothing -------------------------------------------------------------------------------- Experiment Results:- The accuracy values for each combination of window and frequency cutoff is as shown below: -------------------------------------------------------------------------------- Window Size Frequency Cutoff Accuracy value -------------------------------------------------------------------------------- 0 1 0.4674 0 2 0.4674 0 5 0.4674 -------------------------------------------------------------------------------- 2 1 0.7383 2 2 0.7262 2 5 0.7021 --------------------------------------------------------------------------------- 10 1 0.7412 10 2 0.7353 10 5 0.7124 -------------------------------------------------------------------------------- 25 1 0.793 25 2 0.7414 25 5 0.7565 -------------------------------------------------------------------------------- We see in general that as the window size goes on increasing the accuracy value goes on increasing. This is quite obvious as we see that if the window size increases then the content words around the ambiguous word under consideration. This gives us more and more information about the particular sense occurring in that instance. We also see that the accuracy value goes on decreasing as the frequency cutoff value is increased. This follows from the fact that a word is surrounded by less key words than words like verbs, articles and prepositions. This is obvious from the fact that when we say "telephone line" we know that line here means "cord". But "telephone" is less frequent around "line"(meaning "cord") than are words like "the", "is" etc. Thus we increase the frequency cutoff we are at a risk of loosing the keyword and retaining the less important ones. The trend is followed generally in the observations made from the experiments performed and summarized in the table above. Thus we see that making window size as large as possible and frequency cutoff as low as possible (possibly 1) we can get more and more accuracy in assigning senses to ambiguous words. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ cs8761 Natural Language Processing Assignment 3 Archana Bellamkonda November 1, 2002. Problem Description : ( for part II - Naive Bayesian Classifier Implementation ) ------------------- The objective is to assign a sense to a given target word in a sentence using Naive Bayesian Classifier. We start of with some sets of data where we have instances in a particular sense that are previously collected. In this assignment, we do the sense tagging in four steps - 1) select.pl :- This program collects all instances from given data and randomizes the instances after adding information about the sense from which they are extracted, and divides these instances into two sets of data, TEST and TRAIN depending on the percentage entered by the user. 2) feat.pl :- This program identifies all word types that occur within "w" postiions to the left or right of target word, and that occur "more than" "f" times in TRAIN data set. It doesnt include target word as a feature. 3) convert.pl :- This program converts the inpur file to a feature vector representation where features in output of feat.pl are read from standard input. Each instance in file is converted into series of binary values that indicate whether or not each type listed in list of features output by feat.pl has occured within specified window around the target word in the given instance. NOTE:- It also includes number of unobserved features in the specified window around the target word for every instance. This is the last number in the feature vector. 4) nb.pl :- This program wil learn a Naive Bayesian classifier and use that classifier to assign sense tags to test data and the senses are printed along with instance ids, actual sense and probability of the assigned sense. Experiment : ---------- Frequency cutoff 1 2 5 windowsize |__________________________________________ 0 | 0.4844 0.4711 0.4711 2 | 0.8667 0.5378 0.4711 10 | 0.9333 0.4933 0.4711 25 | 0.9111 0.8978 0.5822 Experiments are done with the above shown window sizes and frequency cutoffs. The results are as shown for all twelve combinations. Observations : ------------ -->Expected : -------- As window size increases, number of features that we observe increase and hence we we will learn more about context of a given word and hence we would be observing higher accuracy as window size increases. (taking frequency cut off to be 1). But, the meaning of a word depends on the surrounding context only. For example, meaning of a word in a sentence will generally not depend on meaning of word in other sentences. So, if we go on increasing the window size, starting from zero, we will observe increase in accuracy upto a certain level, and then when window size increases beyond the required context, accuracy will drop again. Observed : -------- As shown in the table above, we observed what we expected. Consider column for frequency cutoff 1. As window size increases, accuracy increased, reached a maximum of 0.9333, and then it dropped again to 0.9111. -->Expected : -------- According to Zipf's law, most of the features are not repeated in a text. So, as frequency cutoff increases for a particular window size, there will be less number of features observed and hence we cant capture the context properly. Thus, accuracy decreases as frequency cutoff increases for a given window size. Observed : -------- As shown in the table above, we again observed what we expected. Consider any row for a particular window size. The accuracy values are decreasing as frequency cutoff increases. --> Expected : -------- When we consider the case where frequency cutoffs(not 1) are increasing, we also estimate that there will be more number of features observed as window size increases and thus, we can expect accuracy to be more. Observed : -------- Consider the columns for frequency cutoffs 2 and 5. We observed what we expected. CONCLUSIONS ---------- Noise : ----- Thus, we should select window size to be optimal. We should observe where we are getting noise, ie, what is the cut off where we are observing unwanted features in our context, and limit our window size to be lower than that cut off. Optimal Combination : ------------------- The optimal combination will be the greatest window size below the cutoff where we are observing noise and and frequency cutoff of "1" as we would observe more features than in higher frequency cutoffs and hence would learn the context of a sense in a better way. Optimal Combination In Our Experiment: ------------------------------------- As seen from the table above, the optimal combination is predicted to be a window size of "10" and frequency cutoff of "1". We got the highest accuracy at that point and it is 0.9333 as shown. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ #------------------------------------------------------------ # Assignment #3 : CS-8761 Corpus based NLP # Title : Word Sense Disambiguation [experiments.txt] # Author : Deodatta Bhoite # Version : 1.0 # Date : 10-31-2002 #------------------------------------------------------------ Introduction ------------ In this assignment we have implemented the Naive Bayesian classifier for word sense disambiguation. The data is used is the 'line' data (all six senses). The classifier learns from a part of the data and then it is tested on the remaining data to assign sense to the ambigous word based on the probabilities it has learned from the training data. We also perform Witten Bell smoothing to assign a non-zero probability to unobserved events. Results of the Experiment#1 (all six senses) -------------------------------------------- The results of the experiments are summarized in the following table. Window size frequency cutoff accuracy 0 1 0.5340 0 2 0.5340 0 5 0.5340 2 1 0.1692 2 2 0.0946 2 5 0.0858 10 1 0.1358 10 2 0.1363 10 5 0.1226 25 1 0.1431 25 2 0.1897 25 5 0.1796 Observations ------------ We observe that the accuracy is far too better when window size is zero. i.e. when we assign the maximum occuring sense to all the instances. Though it has a upper hand in the results, we acknowledge the fact that it is dependent on the probability of the maximum occuring sense in the corpus which is not necessarily always very high. (ex. 99 senses with s1=0.02 and s2 to s99=0.01 is the probability of senses. In this case accuracy will be 0.02.) Thus, though it gives a higher accuracy than the other cases in this instance it is not a good approach. Among the other window sizes and frequency cutoffs, the window size of 25 and frequency cutoff of 2 gives us the highest accuracy. We can justify the high accuracy by the increase in the window size. The decrease in accuracy at frequency cutoff 5 and 1 probably suggests that the word providing strong evidence to the sense have a frequency greater than or equal to 2 but less than 5. The decrease of accuracy at frequency cutoff equal to 1 suggests that words occuring once 'mislead' the classifier, by pointing towards a wrong sense. We also see that, in general, the accuracy improves with increase in window size (with the exceptions of W=2 and F=1 and W=0). There doesn't seem to be a direct relation between the frequency cutoff and the accuracy. Experiment#2 (cord2 & division2) -------------------------------- I also used the classifier for classifying two senses. The results for cord2 and division2 as two senses of 'line' are: Window size frequency cutoff accuracy 0 1 0.5022 0 2 0.5022 0 5 0.5022 2 1 0.5200 2 2 0.5067 2 5 0.5200 10 1 0.7511 10 2 0.6400 10 5 0.6667 25 1 0.8133 25 2 0.8222 25 5 0.8133 We observe that the accuracy is now increasing with the window size. The frequency cutoff don't seem to matter much. Interestingly, the maximum accuracy is again at window size 25 and frequency cutoff 2. The overall accuracy of the classifier has also increased. And we will try to see how the classifier does as the number of senses increase. Experiment#3 (cord2,division2 and formation2) --------------------------------------------- The results of the classifier for cord2, division2 and formation2 as three different senses of 'line' are as follows: Window size frequency cutoff accuracy 0 1 0.3424 0 2 0.3424 0 5 0.3424 2 1 0.3393 2 2 0.3333 2 5 0.3212 10 1 0.5485 10 2 0.3909 10 5 0.4121 25 1 0.6667 25 2 0.5636 25 5 0.4394 We observe that the pattern of the output is not the same as for 2 senses. We see that the accuracy of window size 0 is higher than when it is 2. This will happen as we increase the number of senses. However the average of the accuracy does increase as the window size increases. The highest value is observed at window size 25. Also note that the overall accuracy of the classifier has decreased as the number of senses increases. Conclusion ---------- The classifier performs better when the number of senses is less. The accuracy of the classifier increases as the window size increases. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Naive Bayesian Classifier Bridget Thomson McInnes 31 October, 2002 CS8761 ---------------------------------------------------------------------------- EXPERIMENTS: ---------------------------------------------------------------------------- | Window Size | Frequency Cutoffs | Accuracy | |--------------|----------------------|--------------| | 0 | 1 | 0.5386 | |--------------|----------------------|--------------| | 0 | 2 | 0.5346 | |--------------|----------------------|--------------| | 0 | 5 | 0.5471 | |--------------|----------------------|--------------| | 2 | 1 | 0.7299 | |--------------|----------------------|--------------| | 2 | 2 | 0.7122 | |--------------|----------------------|--------------| | 2 | 5 | 0.6479 | |--------------|----------------------|--------------| | 10 | 1 | 0.7627 | |--------------|----------------------|--------------| | 10 | 2 | 0.8142 | |--------------|----------------------|--------------| | 10 | 5 | 0.7805 | |--------------|----------------------|--------------| | 25 | 1 | 0.7620 | |--------------|----------------------|--------------| | 25 | 2 | 0.7625 | |--------------|----------------------|--------------| | 25 | 5 | 0.7203 | |----------------------------------------------------| ANALYSIS: ---------------------------------------------------------------------------- The accuracies for the window size of zero is approximately the same for each of the runs, approximately 50% accuracy. This is due to the fact that there are not any features for the classifier to train from. The classifier picks the most frequent sense of the instances in the training data and applies this sense to each instance in the test data. Given this it might be thought that the accuracy for a window size of zero should be the same no matter what the frequency count is. This is no the case because the training and test instances are randomly chosen each time the program is run. Therefore, the number of instances for each tag in the test and training files vary at each run of the program. The accuracy for the window size of two is definately higher than the accuracy for a window size of zero. This is as expected because with a window size of two the classifier is not picking the most frequent sense for every instance. It is using the features from the training data do determine the sense of the instances in the test data. The run made with the frequency cut off of 5 is lower than the runs with the frequency cut off of one and two. I believe this is the case because with a frequency cut off of five, relevant features are not being included. The chance of a relevant feature occurring 5 times is smaller than it occurring once or twice. The accuracy for the window size of ten is higher than the accuracy for a window size of one or two. I believe this is the case because more relevant features can be captured with a larger window size. There is a point though that a window size can be to large making it difficult to provide the appropriate sense to the target word. This is discussed in the next paragraph. With a window size of ten and a frequency cut off of five, the accuracy is lower than with a cutoff of one and two. As stated above, I believe this to be the case because with a frequency cut off of five, relevant features are not being included. The accuracy for the window size of 25 is lower than the accuracy of a window size of ten. I believe this is the case because the words to the left and right of the target word are indicative of what tag corresponds to the instance. When a larger window size is employed the words to the left and the right of the target word become vague, and less indicative. This creates noise which makes it more difficult to determine the sense of the instance. The accuracy for the window size of 25 is higher that the accuracy of a window size of one or two. I believe that this is the case because as said above a larger window size the corresponding features become to vague and create noise while a smaller window size does not have create enough features to uniquely determine a tag of an instance. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ #Suchitra Goopy #Natural Language Processing #11/01/2002 The following data was used for this experiment: cord2 phone2 --------------------------------------------------------------------------- Report ------ The main aim of this experiment was to perform "Word Sense Disambiguation". Many words tend to have the same form but have different meanings.So given a word, one cannot accurately tell it's meaning just by looking at the word. But on the other hand,if the whole sentence is given then we can guess the meaning of the word. For eq: Consider the word "bank" This word has many meanings and cannot be disambiguated by just looking at it. I went to the bank to get money. I sat by the river bank. Now it is easy to see which sense of the word bank is being used given the whole sentence.So it is disambiguated mainly by the surrounding words in the sentence. The idea of performing this experiment was the same.We look at the surrounding words and begin to build an idea of the sense of the word. We then test a new set of data and we assign "sense-tags" to them depending on their surrounding words. There were four phases that had to be implemented during the course of this experiment. Phase 1: To take a few files and randomize the instances in them.An instance can consist of two or more sentence. We then put A percent of them in a file called TRAIN and the remaining sentences in the file called TEST Phase 2: We then take the TRAIN file and begin to pick out words that occur within a given window around the target word. For example if a window of 2 is specified the for the following sentence I went to the bank to get money Here the window words are to,the,to,get(considering that the target word here is "bank").Then we check if the window words occur above a certain specified frequency.If they do occur then they are selected as the features. Phase 3: We the convert the features obtained in Phase 2 into a feature vector representation using both the TRAIN and TEST files We are careful not to let features be derived from the TEST file Phase 4: Implement the Naive Bayesian Classifier The experiments were performed with different combinations of window sizes and frequencies.The window sizes and the frequency values and the obtained accuracy is given below. Window Size Frequency Accuracy --------------------------------------- 0 1 0.04012 0 2 0.04012 0 5 0.04012 2 1 0.04012 2 2 0.04012 2 5 0.04012 10 1 5.37654 10 2 5.37654 10 5 5.37654 25 1 5.37654 25 2 5.37654 25 5 5.37654 Analysis: ---------- From what I observed the accuracy values are not correct.It is not performing as I had expected.It is very unsusual that I would get the above frequency values if it was working as I had expected.I assumed that as the window sizes increased I would have more number of "window words". But as the frequency increased I would have lesser number of window words. So for a high window value and low frequency value,I thought that the disambiguation would work better because I have more number of features to help me disambiguate the words. For small number of window sizes I will not have many features which would realy help me to disambiguate a word,because the important features that help in the "disambiguating process" would not necessarily occur within the small window size. Thus a big window size and small frequency values would work the best. Conclusion: ----------- I have found that there were many ways that my experiment has not performed as I had expected.The experiment helped me understand how the process of word sense disambiguation works.Tagging sentences at the open mind site also helped me encounter new sentences,meanings and how words are used.It helped me understand what had to be done before the experiment was started. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Bayesian Classifier Experiments Report by Paul Gordon (1913768) for CS 8761, Statistical Natural Language Processing November 1, 2002 Objective: To draw conclusions about how window size and cutoff frequency affect the accuracy of sense tagging. Introduction: This report summarizes the results of a Bayesian Classifier for determining word sense. The experiment varies two parameters, window size, and frequency cutoff. The windows size is the number of tokens on either side of the target word (the word to be sense tagged), that will be included as part of the context used to determine the sense tag. A word must appear more than the frequency cutoff number of times for it to be used as a context word. Hypothesis: A reasonable first approximation might be a Markovian argument. That is, that as the window size increases, it will increase the accuracy, but that at some point, it will begin to stabilize or fall again as the classifier incorporates tokens further and further from the target word. An argument for the frequency parameter might be that as the frequency cutoff increases, it reduces the amount of context, and so will reduce the accuracy. Results: window size | frequency cutoff | accuracy 0 1 0.5510 0 2 0.5510 0 5 0.5510 2 1 0.7309 2 2 0.7213 2 5 0.7357 10 1 0.7663 10 2 0.8161 10 5 0.8032 25 1 0.8000 25 2 0.8474 25 5 0.8305 Conclusions: It appears that the hypothesis was wrong on both counts. As the window size increases, the accuracy increases, without let up. Also, as the frequency cut off increases, the accuracy also increases. It appears that the increase in relevance brought about by increasing the frequency cutoff more than makes up for the reduction in context. The results also seem to call into question the Markov model of local dependence. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Victoria Harp 1632107 CS8761 November 1, 2002 Programming Assignment #3 window size | frequency cutoff | accuracy -------------------------------------------------- 0 - 0.538955823293173 2 1 0.601606425702811 2 2 0.576706827309237 2 5 0.525301204819277 10 1 0.799196787148594 10 2 0.769477911646586 10 5 0.722088353413655 25 1 0.809638554216867 25 2 0.789558232931727 25 5 0.719678714859438 As the frequency cutoff goes higher, the accuracy decreases. This means we are concentrating on stop words and ignoring contextual content. Thus, the most-optimal frequency cutoff is 1, where all words in the sentence-window are given weight. The most-optimal window was 25 (with a frequency cutoff of 1.) I was surprised by this result because I figured our Markov Assumption would hold true and only words very close to the target would influence the meaning. However, after more analysis, I realized that a large window has room to hold stop words and garbage while also picking up the important clue words. The smaller windows were hampered by stop-garbage. Our experiment is slightly skewed because we are considering each sentence individually. If we were to expand our view to the paragraph level, our data and analysis would change. I suppose that could be considered the Anti-Markov Assumption... But that line of reasoning goes beyond the scope of the current project. The zeros in the probability column in the TAGGING outfile mean the calculated probability value is very small and the significant digits are too far to the right to be displayed. I checked and double-checked my formula, but I couldn't see any way to crank up the accuracy. I welcome your suggestions. Good night! ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Prashant Jain CS8761 Assignment #3: Finding meaning in Li(f|n)e Introduction ------------ The objective of this assignment was to "explore supervised appproaches to word sense disambiguation. We have been given the line data which contains six files.There are a number of instances per file but there is only one instance per line. What we have to do is to implement a Naive Bayesian Classifier to perform word sense disambiguation. Basically what it means is that given an instance of 'line' in a file, we have to find out which would be the best sense that should be used with it. Procedure --------- We had to create four files that basically implemented the Naive Bayesian Classifier.These were-: select.pl --------- This file divided the given data into TEST and TRAIN data after sense tagging it. Its provided with the argument of Percentage as well as the target.config file which contains the regular expressions to extract lines and instance ids. feat.pl ------- This file uses the TRAIN data and uses the window size and frequency count(provided by the user at the command line) to get feature words(words of interest) which are put in the FEAT file. convert.pl ---------- This file converts both the TRAIN and TEST data into its feature vector representations using the features provided by the feat.pl file. This is basically a binary representations of instances/senses and features. nb.pl ----- This is the file in which we implement naive bayesian classifier. We use both the files that we get after converting the TRAIN data and use them to assign sense values to the TEST data and check how accurately were we able to assign the correct senses. Observations of Experiments --------------------------- The following table describes the results we got from running our experiments over the various possibilities that had been given to us. -------------------------------------------- |window size | frequency cutoff | accuracy| -------------------------------------------- | 0 | 1 | 0.3728 | | 0 | 2 | 0.3728 | | 0 | 5 | 0.3728 | -------------------------------------------- | 2 | 1 | 0.7485 | | 2 | 2 | 0.7485 | | 2 | 5 | 0.7189 | -------------------------------------------- | 10 | 1 | 0.7692 | | 10 | 2 | 0.7663 | | 10 | 5 | 0.7604 | -------------------------------------------- | 25 | 1 | 0.7308 | | 25 | 2 | 0.7337 | | 25 | 5 | 0.7071 | -------------------------------------------- We notice that generally as we increase the window size, the accuracy of our naive bayesian classifier increases. Intuitively speaking, this should be expected since the more feature words we incorporate in our trials, the more the chances are that they would occur in the test data. And more the number of samples, more the probablity of assigning the correct sense. We also notice that as we increase the frequency of the words, there is a definite decrease in the accuracy. Again, intuitively, this should be expected. Becuase as we keep increasing the frequency count, if (as in our case) the stop words etc. havent been eliminated, they would have a higher chance of being the only ones which occur (since stop words like 'a', 'and', 'the' etc. appear with a lot more frequency than say a 'instrument' which would be a helpful hint in getting the sense of line as 'phone') rather than other more interesting yet a little lesser frequent words. As we notice from the data that we get, the maximum accuracy that we get is when the window size is 10 and the frequency count is 1. We see that above that and indeed below that, there is no other column which has the same accuracy.Hence this is the one which seems to be optimal to me. Just to check I ran nb.pl for two more values, the window size of 15 and the window size of 5. The results were just as I had expected. As we increased the window size from 10 to 15, the accuracy decreased. This was an expected occurance since we had seen a fall in the values of accuracy from the window size of 10 to the window size of 25. Again, when decreasing the window size, we saw that the accuracy value increased. This wasnt entirely expected however. |-----------|---------|---------| |window-size|frequency|accuracy | |-----------|---------|---------| | 15 | 5 | 0.7396 | |-----------|---------|---------| | 5 | 5 | 0.7870 | |-----------|---------|---------| Hence I would like to conclude by saying that the naive bayesian classifier implemented by me gives pretty decent results and shows the property of increasing accuracy upto a certain window size and frequency after which it gradually goes down. References: ----------- Manning ,C.G. & Schutze Hinrich.2000. Foundations of Statistical Natural Language Processing.MIT Press.Cambridge Massachusetts. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Rashmi Kankaria Assignment no : 3 Due Date : 2nd nov 2002. Objective : To explore supervised approaches to word sense disambiguation. Introduction : Supervised Disambiguation uses the set of training data which is already generated for a ambiguous word by sense tagging the words in training data and this labeled data is now used to disambiguate the word in next instance where it has occurred. This experiment is an attempt to implement Naive Bayesian Classifier. Window size Freq_cutoff accuracy most common sense ---------------------------------------------------------------------- 0 1 product ( with probability 0.5320) 0 2 product ( with probability 0.5320) 0 5 product ( with probability 0.5320) ---------------------------------------------------------------------- 2 1 0.6474 2 2 0.6450 2 5 0.6434 ---------------------------------------------------------------------- 10 1 0.7687 10 2 0.7631 10 5 0.7562 ---------------------------------------------------------------------- 25 1 0.7897 25 2 0.7731 25 5 0.7655 ---------------------------------------------------------------------- In a case where window size is 0, then the sense with maximum probability , here product will be assigned as default sense. This is empirically shown to be correct. Q> What effect do you observe in overall accuracy as the window size and frequency cutoffs change ? For a given 70-30 split of the training and test data, we can observe that the accuracy goes on increasing as we increase the window_size and that is reasonable because as we increase the window size and so the context, more features will be present to disambiguate the word sense and hence more chance of the frequency of the feature for a given sense and hence higher will be the probability to guess the sense with respect to that word. As we look at the table, we can draw certain conclusions about the pattern of accuracy for a given window_size and frequency cutoff. For constant frequency cutoff , the accuracy increase with increase in window_size considerably. Eg : The accuracy increases from 0.645 to 0.7731 as we increase the window_size from 2 to 25. Thus it is easy to conclude that this will be always true that with increase in window_size, the accuracy will increase. For constant window_size, however with increase in frequency cutoff,the accuracy does not increase linearly for all window sizes as we might have expected and i think that can be argued. With lower frequency cut off , for a given window size, only one feature word is taken into consideration and that is most of the time stop word. The stop words does not give any significant information about the word in the context which is very peculiar to the word and which will help it to disambiguate. On the other hand, if the window size is to large, there can be some extraneous information ( noise) which can get added as features but which on the contrary may not be so helpful. The most significant values here are wind_size = 10 and freq cutoff = 1 and window size = 25 and freq cutoff = 5. If you observe this,the accuracy of the first one is more than the second and this can help us to deicde that optimal freq cut off and window size. The major flaw with Naive Bayesian Classifier is that it considers all the features to be indepndant. This also affects the calculations of probabilities of features. Q> Are there any combinations of window size and frequency that appears to be optimal with respect to the others ? why? As argued above, there are few combinations of window_size and frequency cut off that appear optimal with respect to others. As we can observe, for a given window size, for increasing cutoff, the accuracy decreases. Also we observe that there is no significant change in accuracy as we change the window size from 10 to 25 so optimal window_size for this case can be 10 as just increasing the window size does not make significant change in the accuracy for 2 reasons. 1. Context might be still lesser than the window_size.in this case, increasing the window size does not make any difference. 2. Also the proximate context matters to disambiguate the sense and hence having larger window might not help much. As far as the frequency cut off is considered, are most optimal number will be within 2-5 as higher frequency features are most of the time stop words and are of not any help to disambiguate and a frequency as low as 1 will output all the feature words and most of them are not relevant or occur very infrequently with the word which we are going to disambiguate.any word within the gives a range will show strong association with the word we need to disambiguate over a large training set. This gives us the optimal values of window size and frequency cutoff. References : 1. Foundations of Statistical natural language processing by Christopher D Manning and Hinrich Schtutze. pg 235 - 239. 2. Programming Perl ( 3rd edition) by Larry Wall,Tom christiansen and Jon Orwant. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ NAME: SUMALATHA KUTHADI CLASS: NATURAL LANGUAGE PROCESSING DATE: 11 / 1 / 02 CS8761 : ASSIGNMENT 3 -> OBJECTIVE: TO EXPLORE SUPERVISED APPROACHES TO WORD SENSE DISAMBIGUATION. -> INTRODUCTION: -> WORD SENSE DISAMBIGUATION: ASSIGN A MEANING TO A WORD IN CONTEXT FROM SOME SET OF PREDEFINED MEANINGS(OFTEN TAKEN FROM A DICTIONARY). -> SENSE TAGGING: ASSIGNING MEANINGS TO WORDS. -> FROM A SENSE TAGGED TEXT WE CAN GET THE CONTEXT IN WHICH PARTICULAR MEANING OF A WORD IS FOUND. -> CONTEXT FOR HUMAN: TEXT + BRAIN -> CONTEXT FOR MACHINE: TEXT + DICTIONARY/DATABASE -> MAIN PARTS OF ASSIGNMENT: -> TO SELECT RANDOMLY A% OF INPUT TEXT AND PLACE THEM IN TRAIN FILE. REMAINING TEXT IS PLACED IN TEXT FILE. -> TO SELECT FEATURES FROM THE INPUT FILE (TRAIN FILE) WHICH SATISFY A FREQUENCY CUTOFF. -> TO CREATE A FEATURE VECTOR FOR EACH INSTANCES THAT ARE PRESENT IN BOTH TRAIN AND TEXT FILES. -> TO TO LEARN NAIVE BAYESIAN CLASSIFIER FROM THE OUTPUT OF THIRD PART OF ASSIGNMENT AND TO USE THAT CLASSIFIER TO ASSIGN SENSE TAGS TO TEST FILE. -> WHEN CREATING SENSE TAGGED TEXT YOU ARE BUILDING UP A COLLECTION OF CONTEXTS IN WHICH MEANINGS OF A WORD OCCUR. THESE CAN BE USED AS TRAINING EXAMPLES. -> THE BASIC PRINCIPLE INVOVLED IN WORD SENSE DISAMBIGUATION IS TO SELECT TH EVALUE OF THE SENSE THAT MAXIMISES THE PROBABILITY OF THAT SENSE OCCURING IN THE GIVEN CONTEXT (MOST LIKELY SENSE). -> WHILE USING "NAIVE BAYESIAN CLASSIFIER " WE ASSUME THAT THE FEATURES ARE CONDITIONALLY INDEPENDENT, THEY DEPEND ONLY ON THE SENSE. -> NAIVE BAYESIAN CLASSIFIER : S=ARGMAX P(SENSE) PRODUCT( P(C(i/SENSE)) SUCH THAT i=1 TO N WHERE S IS SENSE. -> REPORT: -> WE ARE RUNNING THE PROGRAMS WITH 12 COMBINATIONS OF WINDOW SIZE AND FREQUENCY CUTOFFS USING A 70 _ 30 TRAINING-TEST DATA RATIO. WINDOW SIZE FREQUENCY CUTOFF ACCURACY 0 1 0.5311 2 1 0.5344 10 1 0.5331 25 1 0.5040 0 2 0.5311 2 2 0.5000 10 2 0.4991 25 2 0.4941 0 5 0.5311 2 5 0.4481 10 5 0.4341 25 5 0.4121 -> OBSERVATIONS: -> WHEN THE WINDOW SIZE IS ZERO, NO MATTER THE WHAT'S THE VALUE OF FREQUENCY, ACCURACY IS ALMOST EQUAL. -> WHEN THE FREQUENCY IS KEPT CONSTANT AND WINDOW IS INCREASED, THE ACCURACY IS DECREASING. -> WHEN THE WINDOW SIZE IS KEPT CONSTANT AND FREQUENCY IS INCREASED, THE ACCURACY IS DECREASING. -> THERE IS SOME RELATION BETWEEN FREQUENCY, WINDOW SIZE AND OVERALL ACCURACY, BECAUSE THE MEANING OF A WORD CAN BE GUESSED FROM IT'S SURROUNDING WORDS. ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ # ******************************************************************************** # experiments.txt Report for Assignment #3 Word Sense Disambiguation # Name: Yanhua Li # Class: CS 8761 # Assignment #3: Nov. 1, 2002 # ********************************************************************************* This assignment is to apply a Naive Bayesian Classifier to perform word sense disambiguation. I carried out the first 6 experiments with all 6 "line" files. Since CPU limit was exceeded when I did experiments with window size 10, I changed to work with 2 files. CPU limit was exceeded when I used window size 25 for all three experiments. So I can't get results for last three experiments. Resulting Table ****************************************************************** window size | frequency cutoff | accuracy 6 files 0 1 0.530120481927711 0 2 0.530120481927711 0 5 0.530120481927711 2 1 1 2 2 1 2 5 1 __________________________________________________________________ 2 files 10 1 1 10 2 1 10 5 1 25 1 - 25 2 - 25 5 - I worked with several other combinations of window size and frequency for 2 files. like window size | frequency cutoff | accuracy 4 5 1 3 4 1 results for accuracy are the same. From the results, we can see my Naive Baysian Classifier works very well! Except window size 0, other window sizes and frequency cutoffs do not affect accuracy at all because my classifier always assigns correct senses to instances. It is amazing?! I was thinking that window sizes and frequency and the combination of those two should have something to do with accuracy. I thought when window size and frequency cutoff were both not too big and not too small, the accuracy would be the best. But if the method is too good, always does the right thing, then those factors don't matter anymore. For window size 0, we don't have any feature so we actually didn't use classifier to assign senses but give every instance the "most common sense", so accuracy is low. It is expected. It has something to do with window size (because it is 0) but have nothing to do with frequency (because there is no frequency of any feature). ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ # Aniruddha Mahajan (Andy) # CS8761 - Fall 2002 # Assignment 3 -- Word Sense Disambiguation Ordinary or rather every-day language is ambiguous in the sense that it lacks specific instructions which would enable the user to perfectly understand the information in the form it was expressed. Again, in everyday life we as humans work around this ambiguity through various ways, be it because of the 'extra' sense obtained from speech, through gestures or simply through anticipating and deducing what that particular piece of grammar might mean. This is word sense disambiguation. Computers have to be taught to disambiguate the sense of a word as they cannot learn by themselves... in some ways similar to a small child .. yet in some ways so different. The disambiguating ability is to be achieved through a set of training data. Supervised word sense disambiguation has data that is tagged .. or with its sense is connected to that datum as training data. So basically the program has the training data to 'learn' word sense anticipation or compute the sense to be expected whenever a certain few words are around. In given program a corpus of 'line' data was used. It had a total of 6 senses according to which the data is grouped. The experiments performed on the data included 4 programs -- select.pl, feat.pl, convert.pl & nb.pl In the last program we implement a Naive Bayesian Classifier along with Witten-Bell smoothing to the probability mass distribution. Due to the smooothing, unobserved events are not taken with a probability of zero, but the probability mass distribution is shifted a bit to make way for new ones. The table shows the results for various testing combinations as conducted... Window Size Freqency cutoff Accuracy ------------------------------------------------ 0 1 0.5261 0 2 0.5261 0 5 0.5261 2 1 0.7760 2 2 0.7622 2 5 0.7301 10 1 0.8418 10 2 0.8263 10 5 0.8161 25 1 0.8490 25 2 0.8386 25 5 0.8345 ------------------------------------------------- From shown table we can observe that clearly as window size increases, the accuracy of prediction also increases. Though this might become stagnant after some time .. and possibly decrease after more increase in the window size. On the contrary as the frequency cutoff increases within a window size, the accuracy is found to decrease gently. Also the data with a window size of zero (0) will always have the same accuracy - no matter what the frequency cut-off (for a same piece of training data). Another notable point is that uptill a window size of 2 the accuracy seems reasonable enough. During the intial testing of these programs I did the experiments on only 2 senses and those too of only half the size. Still a reasonable accuracy was achieved with W=4 and F=2. The percentage decider 'A' plays a major role in the contributors towards WSD. Here an A of 70 was taken ... thus the train data was more than twice the test data and our little program was introduced to a fairish amount of feature words. Ofcourse just having the data corpus larger would also help and would not biuld the restrictiveness of a small coepus with a large A value. w8_087:11423: <> 0.0004 <> w7_047:14219: <> 0.0002 <> w7_107:826: <> 0.0170 <> w7_125:15262: <> 0.0000 <> w9_7:12942: <> 0.0002 <> w7_124:728: <> 0.0000 <> w7_117:6660: <> 0.0010 <> w7_114:6618: <> 0.0031 <> w7_082:3841: <> 0.0000 <> w7_034:16236: <> 0.5386 <> w9_19:6617: <> 0.0002 <> w7_084:9614: <> 0.0000 <> ACCURACY = 0.7759 Above pasted is a piece of the output obtained by W=2 F=1 on all 6 sense data files.++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ Typed and Submitted by: Ashutosh Nagle Assignment 3: How to Find Meanings in Li(f|n)e. Course:CS8761 Course Name: NLP Date: Nov 1, 2002. +-----------------------------------------------------------+ | Introduction | +-----------------------------------------------------------+ This assignment has following two parts: 1) Data Creation ~~~~~~~~~~~~~~~~ Here I tagged 564 word provided by the Open Mind Word Expert project. 2) Naive Bayseian Classifier Implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have implemented this in four parts as is required by the assignment. The point that I feel, deserves mention is of Witten-Bell Smoothing. Here, Probability of an observed event = Its frequency/(Types+Tokens) Probability of an unseen event = Types/{z*(Types + Tokens)} = (Types/z)/(Types + Tokens) Thus, in Witten-Bell smoothing, the frequency of an unseen event is considered to be (Types/z) and the size of the event space is considered to be (Types + Tokens), which is, obviously, the same for both, the probabilities of seen and unseen events. Also here, the probabilities do not (also should not) add to 1. Because, here we are looking at a particular combination of sense and feature. A given instance can have multiple features for the same sense that it has. So if we sum the number of the features occured with a a particular sense, the sum will not be equal to the number of occurences of the sense. Every sense can potentially occur with any number of features. So the values of the various features for a given sense, are not the various values of same random variable but are infact various random variable which overlap on each other. One Sense Dominates:- One of the senses always dominates. When I performed the experiments with all the 6 files in line data "production2" was the dominating sense. When I performed it with "formation2" "cord2" and "test2", "formation2" was the dominating sense. Accuracy goes fairly high (0.9211) when the experiment is performed with fewer files. And I think this is fairly explainable -- When we have more senses to choose from, the probability of error is more (5/6).But when we work with fewer sense the domain shrinks and the probability of error comes down (2/3). For the same reason the number of possible senses has a much greater effect on the accuracy that the window size or frequency offs have. Also, I am seeing a strange phenomenon, which I initially felt to be a bug in my code. I have spent 2 long days trying to figure this out, but do not see any bug in the code. For all combinations of window size and frequency cut off, I get almost the same accuracy. When I print about 8-10 digits after the decimal point, I see little variation in the values, but not otherwise. So, to check if the values can change or not at all, I performed the experiments by taking different number of possible senses. (i.e. different number of files from line data.) and I found that the values do change but for a particular number of files (possible senses), they remain the same. Tabulated below are my observations: W=Window Size; F=Frequency Cut Off; A(i)=Accuracy with i files taken from the line data. +------------------------------------------------------------------------+ | W | F | A(3) | A(4) | A(6) | +_______________|_______________|_______________|_______________|________+ +~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~+ +---------------|---------------|---------------|---------------|--------+ | 0 | 1 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 0 | 2 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 0 | 5 | 0.9211 | 0.7378 | 0.7089| +_______________|_______________|_______________|_______________|________+ +---------------|---------------|---------------|---------------|--------+ | 2 | 1 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 2 | 2 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 2 | 5 | 0.9211 | 0.7378 | 0.7089| +_______________|_______________|_______________|_______________|________+ +---------------|---------------|---------------|---------------|--------+ | 10 | 1 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 10 | 2 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 10 | 5 | 0.9211 | 0.7378 | 0.7089| +_______________|_______________|_______________|_______________|________+ +---------------|---------------|---------------|---------------|--------+ | 25 | 1 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 25 | 2 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 25 | 5 | 0.9211 | 0.7378 | 0.7089| +---------------|---------------|---------------|---------------|--------+ | 0 | 1 | 0.9211 | 0.7378 | 0.7089| +_______________|_______________|_______________|_______________|________+ +------------------------------------------------------------------------+ I have kept all the output file in the directory /home/cs/nagl0033/NLP/assignment3/ Here there are a few subdirectories too. A directory "ij" contains the files of experiments performed with window size 'i' and frequency cutoff 'j'. Intutive Reasoning: I feel as we increase the window size, we consider more features. So we should be able to predict the sense more accurately. But as the window size becomes too large, we start considering the words that do not even affect the ambiguous word. But the presence of a word distant from the target word in one instance may adversely affect the calculations, if that word, by chance occurs in the vicinity of the target word in some other instance. Secondly, as the frequency cut off is increased, only the words occuring more frequently in the vicinity of the target word are considered. And those which might have occured by fluke are ignored. And this, therefore, should aid better prediction. On the other hand, if the frequency cut offs are too high, then it is equvalent to looking for some small that has a very huge effect and ignoring a number of elements (features), each of which has a fairly reasonable effect, but not a huge one. As the words follow Zipfian distribution, we are likely to find very very few features if the cut off is too high. Expectation: Based on this reasoning, I had expected the (10,5) case (i.e. where window size is 10 and cutoff is 5) to give me optimal result. But I am not able to see this in the my table, probably because of some arithmetic error !! ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ CS 8761 NATURAL LANGUAUGE PROCESSING FALL 2002 ANUSHRI PARSEKAR pars0110@d.umn.edu ASSIGNMENT 3 : HOW TO FIND MEANINGS IN LIFE. Introduction ------------ Many times we come across words which have several meanings (or senses)and so there can be an ambiguity about their meaning. Word sense disambiguation deals with assigning meanings or senses to ambiguous words by observing the context or the neighboring words. In this assignment we attempt to disambiguate various meanings of the word line by using sense tagged line in which six different senses of the word line have been identified and the data is divided into seperate files for each sense. Programs implemented --------------------- select.pl : This program tags every instance of the target word (here line) according to the file in which it is and then randomly divides the data into TRAIN and TEST parts. feat.pl : This module identifies the feature words associated with the word to be disambiguated (here 'line'). The range of feature words can be controlled by changing the window size as well as the frequency of occurance of the feature word.The feature words are obtained from the TRAIN file only. convert.pl: The input to this file is the TRAIN or TEST file and the list of feature words.Each instance in file is converted into a series of binary values that indicate whether or not each feature word listed has occurred within the specified window around the target word in the given instance. nb.pl : This program learns a Naive Bayesian classifier from the feature vector of TRAIN and use that classifier to assign sense tags to the occurences of target word in the TEST file. It tag assigned is the sense having the maximum prob (sense|context).The context word are assumed to be conditionally independent. Expected results ---------------- 1. When the window size is zero we do not have any feature words or context to guess the sense of the target word. The only guideline we have is the probability of each senseand we assign the same sense which has the maximum probability to all the instances of the test data. The accuracy thus obtained will not dependent on the frequency. 2. As we increase the window size, the no of feature words will increase. We expect an increase in accuracy with increase in window size because for a larger window would mean that we are considering a wider context. Hence feature words which were ignored for a small window size will also be used and might give a better accuracy. 3. Increasing the frequency cutoff will again reduce number of feature words which may reduce the accuracy. 4. If we run the program on just 50 % of training data then it should reduce the accuracy as the classifier will have less data to learn from. 5. If the target word has lesser no of senses then a better accuracy can be expected because the word can be has to be assigned a sense which is choosen from a smaller number. Results -------- ------------------------------------------------------------------------------------------ |window size | frequency cutoff | accuracy for | accuracy for |accuracy | | | | 70% Train data | 50% Train data |only two senses | -----------------------------------------------------------------------------------------| | | | | | | | 0 | 1 | 0.5374 | 0.5590 | 0.4979 | | 0 | 2 | 0.5374 | | | | 0 | 5 | 0.5374 | | | -------------|--------------------|-----------------|-------------------|----------------| | 2 | 1 | 0.6737 | 0.7413 | 0.7657 | | 2 | 2 | 0.6358 | | | | 2 | 5 | 0.6132 | 0.6963 | 0.7448 | -------------|--------------------|-----------------|------------------------------------| | 10 | 1 | 0.7429 | 0.8148 | 0.8870 | | 10 | 2 | 0.7172 | | | | 10 | 5 | 0.6817 | 0.7689 | 0.8452 | -------------|--------------------|-----------------|------------------------------------| | 25 | 1 | 0.7663 | | 0.8912 | | 25 | 2 | 0.7566 | | | | 25 | 5 | 0.7228 | | 0.8703 | ------------------------------------------------------------------------------------------ Conclusions ----------- 1. The accuracy increases as we take more number of feature words. Thus, when we increase the window size or take a low frequency cutoff we are considering a more detailed and context wider context. Hence the accuracy increases. 2. However a very large window size does not help much because the target word is most of the times unrelated to distant words which appear as feature words when the window size is large. 3. The memory and time required for a high window and low frequecy cutoff are quite high as compared to the increse in accuracy that they give. This should be considered while choosing a optimal combination on window size and frequecy cutoff. 4. The window size of 10 and the frequecy cutoff of 1 or 2 seem to give a good accuracy for the line data. However, we cannot make general statements on basis of this, since the senses of line are distinct and have a seperate set of feature words for each sense. 5. Evenif the experiment was performed by taking only 50 % of the line dat as the TRAIN data instead of 70%, the accuracy seems to have increased.This is opposite to what was expected. 6. As expected if the experiment is done using just two senses of the target word, the accuracy has mostly increased. 7. Overall, the performance of the classifier is quite good for the line data with average accuracy above 0.7 . ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ REPORT -------- ----------------------------------------------------------------------------- Prashant Rathi Date -1st Nov. 2002 CS8761 - Natural Language Processing ----------------------------------------------------------------------------- Assignment no.3 ------------------ Objective :: To explore supervised approaches to word sense disambiguation You will create sense -- tagged text and see what can be learned from the same.Sense --tag some text and implement a Naive Bayesian Classifier to perform word sense disambiguation. ----------------------------------------------------------------------------- IMPLEMENTATION: -------------- The implementation is divided into four programs each of which does a specific function. - select.pl creates Train Data and Test Data from the 'line' data by selecting instances randomly. - feat.pl works on this Train Data and outputs the feature words selected given the window-size and frequency cutoff conditions. - convert.pl creates feature vector tables for the Test and train data. - nb.pl implements the Naive Bayesian classifier which helps in word sense disambiguation and produces the accuracy results. * target.config is used to specify the regular expression for the target word and for matching instance_id's.This makes the programs more generalized and can be easily used for other data with slight modifications. ------------------------------------------------------------------------------- OBSERVATIONS: ------------ Experiments were carried out with window sizes of 0,2,10 and 25 and frequency cut-offs of 1,2 and 5.For these 12 combinations of frequency and window sizes and 70-30 training-test data ratio, following observations were made - --------------------------------------------- window size | frequency cut-off | accuracy | --------------------------------------------- 0 | 1 | 0.5418 | 0 | 2 | 0.5418 | 0 | 5 | 0.5418 | 2 | 1 | 0.7355 | 2 | 2 | 0.7251 | 2 | 5 | 0.7002 | 10 | 1 | 0.7685 | 10 | 2 | 0.7902 | 10 | 5 | 0.7982 | 25 | 1 | 0.8053 | 25 | 2 | 0.8264 | 25 | 5 | 0.8384 | -------------------------------------------- -The accuracy results are as shown in the table.These are the accuracy results observed after applying the Witten Bell smoothing.I also observerd the accuracy results without applying smoothing.These values were lower in comparison.Thus, by distributing some probability among the unobserved types (smoothing), we improve the accuracy results. -The accuracy values observed when the window-size is 0 are all the same for different frequency values.This is because it is just the maximum occuring sense in the test data which is selected each time and with window-size as 0, frequency does not come into picture.For window size of 0,the feat does not contain any features and the Test.fv and train.fv just have the instance_id's and the senses.This is the lower bound of the performance.The upper bound being human performance.Also it was observed that product2 was the most frequent sense in this case. -It can be also observed from the table that as the window size increases we get better accuracy.I think it is due to the fact that more we know of a context around the target word,higher is the probability of getting the sense right. -For small windows, according to me the frequency variations do not show many radical changes in accuracy as they do for larger windows. -The smoothing helps improving the accuracy if the feature vectors are large and contain a lot of 0 values( i.e. unobserved types). -For window sizes of 10 and 25, it is seen that as frequency increases, the accuracy increases.This may be because the more a feature word occurs with the target word, the more it contributes to getting the correct sense. -But in case of window size 2, I observed that as frequency increases, accuracydecreases slightly.The only thing I could deduce from this is that with less frequency we get more no. of words and hence may be better context to get the correct sense. -The probability values assigned using the classifier has very low values as can be seen from the output of the nb.pl. -It is also important to maintain distinction between the Test and the Train data.If the Test data itself is used for training purposes, then the results would be flawed as they don't allow to explore all possibilities.It might also be a good idea to create few variant sets of test data for exhaustive testing. -I think the way the instances are picked randomly plays a factor in the accuracy values obtained.This is because it depends on the no. of instances picked with a particular sense. ------------------------------------------------------------------------------- CONCLUSION: ----------- According to me, the word sense disambiguation implemented using the Naive Bayesian Classifier does give reasonable amount of accuracy (atleast the results for the 'line' data show that.It is therefore one of the important methods in supervised learning methods. -------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ CS 8761: Assignment 3 - Naive Bayesian Classifier Due: 11/1/2002 12pm Sam Storie 10/29/2002 +------------------------------------------------------------------------+ | Methods | +------------------------------------------------------------------------+ Introduction: A Bayesian classifier (BC) is a technique used to classify a set of events into "bins" and then use certain information to classify new events into one of those bins. In the context of word sense disambiguation we use a set of words surrounding a "target word" that we're trying to determine the sense of that word. These surrounding words are considered to be "features" of the word and are treated as potential identifying items concerning the sense of the target word. This assignment deals with determining the sense of the word "line". To do this we examine a set of data where the sense of line has been predetermined and use this data to "train" our classifier. Then we use the information from the training data to try to determine the sense of some unseen testing data. The details of this process is described in subsequent sections. Process: This assignment is broken into four seperate programs. The first program examines a set of pre-tagged instances for line. To ensure our classifier isn't developing a bias this program seperates this data into training and testing data. The classifier is developed only using the training data, and isn't shown the testing data until the actual performance testing is done. This program randomly selects a certain percentage of the instances to be placed into a file called TRAIN and the remainder are placed into a file called TEST. This ensures that our classifier won't develop a bias. The second stage of the classifier is a program that determines which words occur within a certain window of the target word in the training instances. Recall from the introduction we are considering these contexual words to be some sort of indicator of the sense for that instance. Of course there are some overlaps for common words, but this is taken into account during this process. To help identify words that appear often within this window we can set a frequency cutoff to eliminate uncommon words. The result of this stage is a list of words that occur within a certain window size, and that also occur a certain number of times. This stage is only performed on the *training* data, since using the testing data would introduce a bias towards the features present in the testing data. This would obviously nullify any results we ultimately obtained from that data. The third program uses the list of words from the previous stage to create feature vectors for all the instances of training and testing data. A feature vector is simply a list of 1's and 0's to indicate whether a given word occured within the window used to generate the features. If a given instance had a 1 for the word "IBM", then this would mean that "IBM" appeared within the window of "line" for this instance. These feature vectors are generated for both the training and testing data. The final stage is to actually create the classifier and attempt to assign senses to the training data. To do this we examine the feature vectors for the training data and determine how often each feature occured with each sense. From this data we can compute the probability associated for a sense based on what words occur within the window size specified in the earlier stages. Then when we examine the feature vector for a test instance we can determine which sense is most likely (based on the training instances used) given the features that are in the window. The probability of each sense is computed by multiplying the probabilities of each feature occuring, given the sense. Then we simply select the sense with the largest resulting probability. The final step is to apply a sense with the classifier and check it against the actual sense. The final accuracy (number correct / total number of instances) is reported to the user. Technically, using probabilities of the features occuring with the target word is a very involved process, but we are implementing a *naive* BC. This means that we treat the probability of each feature occuring with the target word as conditionally independent. If we did not make this assumption we would need to consider the probability of not just the features seperately, but rather occuring in the sequence they actually appear. Zipf's law shows us that once you start considering sequences of more than 3-4 words the probability is very hard to determine. This is simply because those sequences probably do not occur very often and it's hard to assign how likely they should appear. This is a non-trivial distinction, but impressively still generates some useful results. As a final note, in the case of a feature not occuring within a training instance, which would give a probability of zero, we use Wittman-Bell smoothing to give those cases a probability to use in the calculations. +------------------------------------------------------------------------+ | Results | +------------------------------------------------------------------------+ I ran this set of four programs with a combination of window sizes and frequency cut-offs. Per the assignment, the initial set of instances was split into 70% training data and 30% were set aside for testing. The results are summarized as such: Window size | Frequency cutoff | Accuracy +------------------------------------------+ 0 1 .5490 0 2 .5361 0 5 .5418 0 10 .5080 2 1 .4679 2 2 .4735 2 5 .4461 2 10 .3915 <-- Worst result 10 1 .5812 10 2 .5820 10 5 .6326 10 10 .6391 25 1 .5924 25 2 .6761 25 5 .6833 25 10 .7219 <-- Best result +------------------------------------------+ As you can see in the table, the best accuracy obtained was about 72% and this was consistent across several trials. I found it rather interesting to see how the window size affected the accuracy across the trials. For a window size of 0 I would expect that the accuracy would match the frequency of the most common sense. This is because, with no features to help in the classification, the only data that can be used is how often each sense occured. The result is that every testing sense is tagged with the same sense. In the case of line, the "product" sense occurs about 55% of the time (2218 out of 4149 instances), and this is close to the results obtained for the window size of 0. There is a little flucuation due to the randomization of the data, but it's easy to see this is close to what would be expected based on the frequency of the "product" sense. Perhaps more interesting is the case when the window size of 2 is used. Intuitively I imagined that including more data would help determine the sense, but it's clear that this is not only failing to increase the accuracy, it's actually causing it to go down. After seeing this result I thought about what might be causing this and I believe it's due to what types of words occur within that small window. Semantically important words probably do not immediately precede or follow a target word, but are seperated by common words like "the", "for", "at", etc. The classifier is basing its decisions on meaningless words that have skewed its choices towards an uncommon sense (frequency wise), and when used to guess instances where another sense dominates, produces poor results. I think this is also evident when the you examine what happens when the frequency cut-off is raised. This is reducing the feature set to only those "bad words" and are further skewing the decisions. More of what I expected is shown in the results for a window size of 10 and 25. I imagined when you allow for the sense-indicating features to appear farther away from the target you should get better results. I think this is exactly what is happening. The corollary here is that by doing so we open the chance for more "bad words" to start appearing as features, but it's easy to see that as we restrict the frequency requirement (to eliminate more of these extra features) our results improve. While those same meaningless words will still appear as features, now we have other features to base the final decision on. With a window size of 25 we're essentially examining the entire instance, so there's a high chance that if some identifying feature occurs, the classifier will include this and it will influence the final decisions. On a final note, the meaning of a 72% success rate is difficult to guage because for a human this task is often trivial. However, given that the actual tests are based solely on the instance itself, I think it's rather impressive that such a simple technique can get a correct result 72% of the time. I imagine with some tweaking of the various parameters of the experiment, or even the specific data being used, could probably produce better results. Of course, these results are based on the amount of training data available (about 3600 instances in this case) and the breadth of examples that data covers. Still, with these limitations in mind, the Naive Bayesian Classifier is a simple, and clearly useful technique when performing word sense disambiguation. +-------------------------------------------------------------------------+ | References and More Information | +-------------------------------------------------------------------------+ (1) Original assignment on Dr. Pederson's web page - http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a3.html ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++ \\\|/// \\ ~ ~ // ( @ @ ) **********************************-oOOo-(_)-oOOo-************************************ CS 8761 - Natural Language Processing - Dr. Ted Pedersen Assignment 3 : HOW TO FIND MEANINGS IN LIFE Due : Friday , November 1, Noon Author : Anand Takale ( ) ***************************************-OoooO-*************************************** .oooO ( ) ( ) ) / \ ( (_/ \_) Objectives : To explore supervised approaches to word sense disambiguation. To create ---------- sense-tagged text and see what can be learned from the same. Specification : Sense tag some text and implement a Naive Bayesian classifier to perform ------------- word sense disambiguation. ---------------------- Part I : Data Creation ---------------------- Tag 500 instances/sentences on Open Mind Word Expert website. login id : Anand Project : CS8761-UMD Instances Tagged : 515 There was no output to be turned in from Open Mind sense tagging. -------------------------------------------------- Part II : Naive Bayesian Classifier Implementation -------------------------------------------------- This assignment mainly deals with word sense disambiguation. The problem to be solved is that many words have several meanings or senses. For such words given out of context there is ambiguity about how they are to be interpreted. Thus our task is to do the disambiguation i.e to determine which of the senses an ambiguous word is invoked in a particular use of the word. This assignment required us to calculate the Naive Bayesian Classifier. In short it learns a Naive Bayesian classifier from TRAIN data and use the classifier to assign sense tags to TEST data. The idea of Bayes classifier is that it looks at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does not do feature selection. Instead it collects evidence from all features. Implement a Naive Bayesian Classifier to perform word sense disambiguation. Perform your experiments with the "line" data. The assignment required the implementation of four modules: (1) select.pl -- program to construct TRAIN and TEST data (2) feat.pl -- program to extract features. (3) convert.pl -- program to build the feature vector representation (4) nb.pl -- program to learn a Naive Bayesian Classifier from TRAIN.FV and use that classifier to assign sense tags to TEST.FV Our main task was to create a Naive Bayesian Classifier from the TRAIN data and then disambiguating the words from the TEST data. We had to perform disambiguation of "line" data. For this the six files containing different senses of line were combined and lines were read randomly from any of the files and put into TRAIN and TEST files. The TRAIN file consisted of 70% data while the TEST file consisted of 30% files. After creating the TRAIN and TEST files , the features which had a frequency greater than the cutoff frequency were selected. After observing the extracted features it was noted that the conjunctions , prepositions and the determiners like a , an , and , the occured almost in all the cases. To be more specific about the task of disambiguation we need to find more dependent collocations and try to eliminate all the 'stop' words like the conjunctions , determiners etc. so that the accuracy of the classifier improves. ------------ Experiments ------------ Experiment with window sizes of 0,2,10 and 25. Use frequency cutoffs of 1,2 and 5. Run your classifiers with all 12 possible combinations of window size and frequency cutoff using a 70-30 training-test data ratio. Report the accuracy values that you obtain for each combination in a table. ------------------------------------------------------------ window size | frequency cutoff | accuracy ------------------------------------------------------------ 0 | 1 | 0.5399 0 | 2 | 0.5399 0 | 5 | 0.5399 ------------------------------------------------------------ 2 | 1 | 0.6401 2 | 2 | 0.5911 2 | 5 | 0.5413 ----------------------------------------------------------- 10 | 1 | 0.7044 10 | 2 | 0.6996 10 | 5 | 0.5662 ----------------------------------------------------------- 25 | 1 | 0.6843 25 | 2 | 0.6707 25 | 5 | 0.5390 ------------------------------------------------------------ ----------------------------------------------------------------------------------- What effect do you observe in overall accuracy as the window size and the frequency cutoffs change ? ---------------------------------------------------------------------------------- After observing the data recorded in the tables above we come to the following conclusions : (1) When the window size is zero i.e. there are no features available to train the Naive Bayesian Classifier , the sense that occurs the most in the TRAIN data is assigned as the calculated sense for all the instances of the TEST data. In case of line data the accuracy observed for window size of 0 was 0.5399. This was the probability of the sense that occured the most. From this we can conclude that even without having any prior knowledge i.e. without training the Naive Bayesian Classifier. So this is the lowest accuracy that can be obtained from this classifier. So we see for the line data that the minimum accuracy obtained will be 0.5399. (2) For a window size of 2 the value of accuracy increases considerably. We see that the value of accuracy goes upto 0.6401 for a cutoff frequency of 1. What we can conclude from this is that when the window size is one we can predict sense of almost 64% of the test instances.So we see that disambiguation becomes easier when we have some context rather than having no context at all. One more observation that can be made at this point is that according to Zipfs' Law most of the bigrams occur rarely while few occur frequently.We observe the same effect as that of Zipf's law. For the window of size 2 we observe that as the cutoff frequency goes on increasing the accuracy goes on decreasing. This is what is observed by Zipf's law. This proves that most of the bigrams occur very few times. So as the frequency cutoff goes on increasing the chances of all the words occuring becomes less. This is what we observe from the table tabulated above. (3) Again observing values of the accuracy for different window sizes we observe that the value of accuracy goes on increasing till a certain point before it fall off again. Also the phenomenon of decrease in accuracy with increase in window size is observed. Now in both these cases i.e with increase in window size and with increase in cutoff frequency we see that there is a considerable decrease in accuracy. This observed fact can be explained as follows: when we try to increase the window size from 0 to 25 the number of features present goes on increasing. As the number of features goes on increasing the Naive Bayesian Classifier undergoes lot of learning. But all the features are not important to train the classifier.The stop words i.e .all the conjunctions and determiners actually are of no use to train the classifier. What we are looking is a more solid feature which is closely connected to the tagged word. By finding such solid features which are closely connected to the tagged words it becomes much more easier to train the classifier. In short the decrease in accuracy can be considered as consequence of increase in noise words which get added to the feature vector. Now these noise words are introduced as a result of increase in window size or increase in cut-off frequency. ------------------------------------------------------------------------------------ Are there any combinations of window size and frequency that appears to be optimal with respect to the others ? Why ? ----------------------------------------------------------------------------------- From the observations i.e. from the table we observe that the optimum value of accuracy is 0.7044. This value of accuracy is observed when the window size is 10 and the cutoff frequency is 1. This optimum value of the window size and the cutoff frequency are specific to this data. (1) The range of optimum value of window size will be the same. The value of accuracy draws a curve with respect to the window size. Initially as the window size increases the accuracy also increases till it reaches an optimum value from where it again starts to fall off. This is due to the fact that as the window size increases the number of noise words also increase which reduce the overall accuracy of the classifier. This is because the more connected tokens occur fewer times than the stop words i.e. conjunctions , determiners etc. (2) As far the relation between the cutoff frequency and the accuracy goes , it is observed that as the cut-off frequency increases the accuracy goes on decreasing. This observed fact is due to the reason that as the cutoff frequency increases the more dependent words ( or we can say more specific words to that sense or context ) which occur very few times to be eliminated from being considered as features. This leaves only the stop words i.e the conjunctions , prepositions , determiners etc. as the features. Now these stop words can't train the classifier as the specific dependent tokens do. These stop words occur in almost all the instances equi-probably and hence it becomes more difficult to disambiguate the sense causing the accuracy to go lower. So in short we can say that a midrange window size and lower cutoff frequency would most probably give the most optimum value for accuracy. We also conclude that a window size of 0 gives the least accuracy. ----------------- References : ----------------- (1) Programming Perl - O-Reilly Publications (2) Foundations of Statistical Natural Language Processing - Christopher D. Manning and Hinrich Schutze (3) Problem Definition and some related concepts and issues - Pedersen T. ( www.d.umn.edu/~tpederse )