++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ****************************************************** *CS8761 - Assignment 2 - The Great Collocation Hunt * *Author:Amine Abou-Rjeili * *Date: 10/09/2002 * *Filename: experiments.txt * ****************************************************** Objective --------- To investigate the effectiveness of different measures of association in identifying collocations and compare their performance. All measures are to be implemented as NSP (v0.51) modules. The True Mutual Information (tmi.pm) for bigrams and the Log Likelihood Ratio for trigrams were to be implemented and compared with an additional measure that can be used with both bigram and trigram collocations. After extensive research using online resources and library books, I decided to use the t-score association measure. This is a statistical test that tells us how probable or improbable it is that a certain event will occur (Manning and Schutze). It is defined as: t = (sample_mean - distribution_mean) / sqrt(sample_variance/sample_size) This test assumes that a sample is drawn from a normal distribution and it determines how probable it is to draw a sample of this mean and variance. It compares the sample mean and distribution mean and scales it by the sample variance. In these experiments the NULL hypothesis is: H(0): P(W1 W2 ... Wn) = P(W1)*P(W2)* ...* P(Wn) In other words we are assuming that the words are independent and that the occurance of N word collocations is merely by chance. Based on the t-score for a particular collocation we can reject the NULL hypothesis with a certain confidence interval or accept it as being true if we do not fall inside the required confidence interval. The distribution mean is taken to be the product of the probability of the occurance of the individual words of the collocation [p(W1)*p(W2)*...*P(Wn)] since we are assuming that they are independent (NULL hypothesis). I chose the t-test because of its simplicity in implementation. I wanted to compare its performance with other association measures that measure actual counts of events and not on estimated sample mean and variance. I wanted to know how well it will perform compared to the more complex measures of association. I believe that it will not perform as good as the true mutual information because it considers mainly the joint probability of the individual words of the collocation. On the other hand the true mutual information test also considers the cases where the individual words occur separate together with the other cases. Another reason that I chose this test is because it extends easily to trigrams. The NULL hypothesis becomes P(W1 W2 W3) = P(W1) * P(W2) * P(W3) and so the distribution mean will equal P(W1) * P(W2) * P(W3). As we can see this extension is straightforward to understand and to implement (see Experiment 2). I created two corpora of about 1,000,000 words by joining a number of books from Project Gutenberg as follows: CORPUS 1: # TOKENS 1083129 Don Quixote by Miquel de Cervantes Crime and Punishment, by Dostoevsky Robinson Crusoe, by Daniel Defoe The Adventures of Sherlock Holmes by Arthur Conan Doyle Hamlet by Shakespeare PG has multiple editions of William Shakespeare's Complete Works CORPUS 2: # TOKENS 1037882 ==> bndlt10.txt <== The Project Gutenberg Etext of A Bundle of Letters by Henry James ==> frnrd10.txt <== The Project Gutenberg etext of The Friendly Road; New Adventures in Contentment by David Grayson (pseudonym of Ray Stannard Baker) ==> ktysc10.txt <== The Project Gutenberg EBook of What Katy Did At School, by Susan Coolidge ==> lchch10.txt <== The Project Gutenberg EBook of Chess and Checkers: The Way to Mastership by Edward Lasker ==> lstfc10.txt <== The Project Gutenberg Etext of Lost Face, by Jack London ==> mrswf10.txt <== The Project Gutenberg Etext of The Mayor's Wife, by Anna Katherine Green ==> outpo10.txt <== The Project Gutenberg Etext of Outpost by J.G. Austin ==> rbcru10.txt <== The Project Gutenberg Etext of Robinson Crusoe, by Daniel Defoe ==> rdst10.txt <== The Project Gutenberg EBook of Rodney Stone, by Arthur Conan Doyle ==> stjlp10.txt <== The Project Gutenberg EBook of The Story Of Julia Page, by Kathleen Norris ==> strlc10.txt <== The Project Gutenberg Etext of The Story Of Electricity, by John Munro ==> trbny10.txt <== The Project Gutenberg EBook of The Rover Boys in New York, by Arthur M. Winfield ==> txsrn10.txt <== The Project Gutenberg EBook of A Texas Ranger, by William MacLeod Raine ==> wstys11.txt <== The Project Gutenberg Etext of Under Western Eyes, Joseph Conrad ********************************************** *****************Experiment 1***************** ********************************************** TOP 50 COMPARISON: ------------------ Using CORPUS2: Bigrams ------- count.pl CORPUS_2N.cnt CORPUS2 statistics.pl tmi CORPUS_2N.tmi CORPUS_2N.cnt Top 26 entries shown for each test tmi user2 (t-score) --- --------------- 1206150 1206150 ,<>and<>1 0.0232 14223 79083 31336 Todos<>los<>1 1098.2477 1 1 1 .<>The<>2 0.0109 3436 53260 4038 COINCIDENCES<>SUGGESTING<>1 1098.2477 1 1 1 don<>t<>3 0.0087 1256 1257 4248 Fops<>Alley<>1 1098.2477 1 1 1 of<>the<>4 0.0084 6644 26941 52108 laisser<>aller<>1 1098.2477 1 1 1 .<>He<>5 0.0082 2503 53260 2834 Scots<>Magazine<>1 1098.2477 1 1 1 .<>It<>6 0.0075 2331 53260 2686 superseded<>manual<>1 1098.2477 1 1 1 ,<>,<>7 0.0066 2 79083 79083 los<>Santos<>1 1098.2477 1 1 1 in<>the<>8 0.0065 4523 15673 52108 _ce<>miserable_<>1 1098.2477 1 1 1 .<>But<>9 0.0059 1801 53260 2055 ga<>irls<>1 1098.2477 1 1 1 .<>I<>10 0.0058 5149 53260 22711 Ari<>Frode<>1 1098.2477 1 1 1 ,<>but<>11 0.0052 2690 79083 4665 CAVE<>RETREAT<>1 1098.2477 1 1 1 .<>She<>12 0.0050 1560 53260 1806 strutting<>cockerel<>1 1098.2477 1 1 1 ,<>.<>13 0.0044 1 79083 53260 Sallie<>Laundon<>1 1098.2477 1 1 1 the<>,<>14 0.0043 2 52108 79083 DARTAWAY<>Quit<>1 1098.2477 1 1 1 had<>been<>15 0.0042 1013 7956 2446 quel<>nom<>1 1098.2477 1 1 1 Mrs<>.<>16 0.0041 1103 1105 53260 personne<>indiquee_<>1 1098.2477 1 1 1 ;<>and<>17 0.0039 2020 7387 31336 GIMLET<>BUTTE<>1 1098.2477 1 1 1 .<>And<>17 0.0039 1341 53260 1735 INTERFERENCE<>Struck<>1 1098.2477 1 1 1 .<>You<>18 0.0038 1298 53260 1704 Highland<>Fling<>1 1098.2477 1 1 1 Mr<>.<>18 0.0038 1020 1029 53260 Humphry<>Gunther<>1 1098.2477 1 1 1 did<>not<>18 0.0038 790 1643 5777 WHICH<>CHRISTIAN<>1 1098.2477 1 1 1 .<>,<>19 0.0037 133 53260 79083 Andrew<>Gamble<>1 1098.2477 1 1 1 ;<>but<>20 0.0034 1026 7387 4665 TAMES<>GOATS<>1 1098.2477 1 1 1 to<>be<>20 0.0034 1532 27462 5008 Ferned<>grot<>1 1098.2477 1 1 1 I<>am<>20 0.0034 823 22711 968 MOST<>SUDDEN<>1 1098.2477 1 1 1 Project<>Gutenberg<>21 0.0033 322 408 322 azure<>firmament<>1 1098.2477 1 1 1 Considering the top 50 entries in each test, it is clear that the t-score gave better results as compared to the tmi. Most of the top entries of the tmi test include puntuation marks and stop words which are not of much interest. One interesting collocation identified by the tmi measure is "Project<>Gutenberg<>". On the other hand the t-score test gives more meaningful words in the collocation and identifies some good collocations such as "Intellectual<>Peach<>", which most probably has some meaning as a unit of words and does not actually mean an 'intellectual peach'. another good collocation identified by the t-score test is "Padre<>Johannes<>" which is of the proper nouns form. Another interesting collocation is "Rural<>Taste<>". For most of the entries in the t-score table we can safely ignore the NULL hypothesis with a confidence interval of 99.5%. On the other hand there is no comparable measure of confidence interval for the true mutual information measure. From the current experiments, one might be tempted to conclude that the t-score measures are significantly better at identifying interesting collocations as compared to the true mutual information measures. However, here I would like to point out the fact that it might be the case that the t-score measure might be over-classifying entries. In other words it thinks that most bigrams are collocations while they are not. There might be a lot of false positives in the data and from the above sample output it can be seen that some of the top bigrams belong to this classification. This also can be true for the true mutual information measures as well. Here I would like to suggest the use of some sort of ratio that will calculate the accuracy of the measure. In other words it will tell us how many of the bigrams are false positives and which ones are actually true collocations. However, if we disregard the idea of false positives, I can conclude from the above data that the t-score performed better than the true mutual information test in this experiment. One reason for this difference in results can be attributed to the expected frequency of the individual tokens for a bigram. For example, punctuation marks might have appeared a lot of times in the text and this will affect the calculation of the tmi test since it is based on the expected and observed frequencies for each cell. However, the t-score test does not consider all the cells in the contingency table, as discussed above. CUTOFF POINT: ------------- User2.pm: For this measure, it seems that a cutoff exists approximately around the 150 (+-20) rank item. It can be observed that more and more bigrams do not seem to be good collocations. After about the 120th rank it seems that at least a third of the bigrams are not collocations. Since this test has as the numerator the difference between the sample mean and the distribution mean and as we go deeper down in the list the rank decreases, this means that the difference is decreasing at a higher rate compared to the numerator. This indicates that the sample mean and the distribution mean are merging to a common value. Since our NULL hypothesis suggests that the tokens of the sequence are independent of each other, and the mean of the distribution is based on this and the sample mean is converging towards the distribution mean then the sequence in our sample will become more and more independent as we move down the list. tmi.pm: For this measure, there is a cutoff around the 50th rank item. It seems that most of the bigrams in the top positions are not bigrams and consist of punctuation marks and stop words. However, after about the 50th rank we start to see less quotation marks and more word bigrams. We also start to observe some interesting collocations such as "Web<>sites<>" which occurs at rank 50. On the other hand the user2.pm measure ranks this collocation at rank 120 which presents a big difference between the two rankings. This difference is due to the different counts that each measure considers. One frequency count that the true mutual information considers but the t-score does not is the count of when both words did not occured together, and I suggest that this contributes to this difference. RANK COMPARISON --------------- Comparing ll with tmi and user2: -------------------------------- ***************************************************** csdev22(52):>./rank-script.sh ll tmi CORPUS_2N.cnt Rank correlation coefficient = -0.8878 csdev22(53):>./rank-script.sh ll user2 CORPUS_2N.cnt Rank correlation coefficient = 0.8745 ***************************************************** Comparing mi with tmi and user2: -------------------------------- ***************************************************** csdev18(15):>./rank-script.sh mi tmi CORPUS_2N.cnt Rank correlation coefficient = -0.8889 csdev18(16):>./rank-script.sh mi user2 CORPUS_2N.cnt Rank correlation coefficient = 0.9516 ***************************************************** From the above results it can be observed that user2.pm (t-test) is very closely related to both the mutual information and the log likelihood ratio. To add to this, it is near perfect match to mutual information. On the other hand the true mutual information test is a reversed ranking as compared to the mutual informatio and the log likelihood test to approximately the same degree. One reason to account for this is that the true mutual information test accounts for all the cells in the contingency table, but the mutual information is only concerned with one of the cells, so the influence of the frequency counts differ. This is also the case with user2.pm which only considers certain cells and not all of them. OVERALL RECOMMENDATION ---------------------- Based on the experiments carried out, I would recommend the user2, mi and ll measures of association since they are very similar. I came up with this conclusion from the results of the user2.pm. From the rank.pl program it was seen that it is very similar to ll and mi and so they should produce approximately similar results in term of rankings. The output of tmi was not very good in term of the top entries which were not very meaningful and seemed to ordinary bigrams with no association. Also, I noticed that the user2.pm (t-test) converges to the z-test when the degrees of freedon are large as is in these experiments. This can also be seen from the t-distribution confidence interval values. ********************************************** *****************Experiment 2***************** ********************************************** This experiment involves collocations composed of 3 word sequences (trigrams). The first test to implement is the log-likelihood test modified to consider trigrams. The second test is the t-test measurement (user3.pm) which will be extended to acomodate for three variables instead of two. As a result it will be able to handle 2x2x2 contingency tables. The following experiments were carried out: TOP 50 RANK ----------- Example run of the program: --------------------------- count.pl --ngram 3 CORPUS_3N.cnt CORPUS_3N statistics.pl --ngram 3 user3 CORPUS_3N.user3 CORPUS_3N.cnt top 26 entries shown for each test user3 ----- 1206149 _ma<>chere<>amie_<>1 1206149.0000 1 1 1 1 1 1 1 YZ<>MN<>OP<>1 1206149.0000 1 1 1 1 1 1 1 DEN<>WILD<>ZEE<>1 1206149.0000 1 1 1 1 1 1 1 saute<>aux<>champignons<>1 1206149.0000 1 1 1 1 1 1 1 ai<>trop<>dit<>1 1206149.0000 1 1 1 1 1 1 1 Cuesta<>del<>Burro<>1 1206149.0000 1 1 1 1 1 1 1 se<>laisser<>aller<>1 1206149.0000 1 1 1 1 1 1 1 vais<>seux<>cella<>1 1206149.0000 1 1 1 1 1 1 1 EXTRA<>PRIVATE<>LESSONS<>1 1206149.0000 1 1 1 1 1 1 1 WX<>GH<>IJ<>1 1206149.0000 1 1 1 1 1 1 1 WHICH<>CHRISTIAN<>MEETS<>1 1206149.0000 1 1 1 1 1 1 1 SPIRITUAL<>INTERFERENCE<>Struck<>1 1206149.0000 1 1 1 1 1 1 1 cauld<>kail<>hae<>1 1206149.0000 1 1 1 1 1 1 1 MN<>OP<>QR<>1 1206149.0000 1 1 1 1 1 1 1 COINCIDENCES<>SUGGESTING<>SPIRITUAL<>1 1206149.0000 1 1 1 1 1 1 1 Todos<>los<>Santos<>1 1206149.0000 1 1 1 1 1 1 1 AB<>CD<>EF<>1 1206149.0000 1 1 1 1 1 1 1 SUGGESTING<>SPIRITUAL<>INTERFERENCE<>1 1206149.0000 1 1 1 1 1 1 1 DAN<>BAXTER<>GIVES<>1 1206149.0000 1 1 1 1 1 1 1 filet<>saute<>aux<>1 1206149.0000 1 1 1 1 1 1 1 YE<>GENTLE<>READER<>1 1206149.0000 1 1 1 1 1 1 1 STEVE<>OFFERS<>CONGRATULATIONS<>1 1206149.0000 1 1 1 1 1 1 1 CD<>EF<>ST<>1 1206149.0000 1 1 1 1 1 1 1 _la<>personne<>indiquee_<>1 1206149.0000 1 1 1 1 1 1 1 BAXTER<>GIVES<>AID<>1 1206149.0000 1 1 1 1 1 1 1 Intellectual<>Peach<>Parer<>1 1206149.0000 1 1 1 1 1 1 1 ll3 ---- 1206149 .<>,<>the<>1 7786623.9583 2 53260 79083 52108 133 1502 2230 .<>.<>,<>2 7751902.6629 89 53260 53260 79083 1603 4088 133 ,<>of<>,<>3 7715136.9923 2 79083 26941 79083 505 5551 49 ,<>I<>,<>4 7394871.8862 3 79083 22711 79083 3598 5551 196 ,<>and<>,<>5 7341499.8465 333 79083 31336 79083 14223 5551 651 ,<>in<>,<>6 7192382.4111 1 79083 15673 79083 1132 5551 235 ,<>was<>,<>7 7079014.9195 6 79083 12419 79083 500 5551 249 ,<>that<>,<>8 7026236.7496 45 79083 12135 79083 1212 5551 472 ,<>,<>one<>9 6962526.4875 1 79083 79083 3502 2 248 198 ,<>you<>,<>10 6904810.7321 5 79083 9087 79083 858 5551 605 ,<>had<>,<>11 6904386.0245 4 79083 7956 79083 277 5551 83 two<>,<>,<>12 6891919.9773 1 1611 79083 79083 56 147 2 ,<>he<>,<>13 6880522.0448 1 79083 8746 79083 1592 5551 153 ,<>for<>,<>14 6873635.0312 16 79083 7959 79083 990 5551 96 ,<>s<>,<>15 6859784.9031 1 79083 6644 79083 1 5551 105 ,<>as<>,<>16 6843698.9151 9 79083 8024 79083 1899 5551 30 square<>,<>,<>17 6837502.3425 1 174 79083 79083 23 11 2 ,<>,<>V<>18 6834679.2425 1 79083 79083 79 2 4 1 ,<>is<>,<>19 6830741.7429 1 79083 6389 79083 270 5551 222 ,<>my<>,<>20 6823826.3993 3 79083 6056 79083 330 5551 9 and<>,<>the<>21 6821418.3294 5 31336 79083 52108 651 1409 2230 ,<>be<>,<>22 6788346.9271 1 79083 5008 79083 43 5551 74 ,<>of<>.<>23 6773018.1244 1 79083 26941 53260 505 1716 47 ,<>she<>,<>24 6769209.9658 1 79083 5660 79083 1158 5551 86 ,<>me<>,<>25 6762351.7502 3 79083 5225 79083 25 5551 935 ,<>have<>,<>26 6762303.2148 2 79083 4391 79083 59 5551 45 Here we have also a similar case to the bigram situation. It seems that in this case the ll3 measure produces top results with meaningless collocations that are merely trigrams.On the other hand the user3 test produces top results that consisted or alphabetic characters but the majority of them , I believe, are not meaningful collocations but just happened to be next to each other. However, there were a some good exampls of collocations such as "Electromotive<>Force<>Resistance<>" to describe a certain force. Here, I would say that the user3 test performed better with regard to the top 50 rankings. CUTOFF POINT ------------ The cutoff point for these two measures differ tremendously. Even the rankings are vastly different. Consider the following trigram and its associated rank. In the case of the ll3 it is ranked at 713651 but in the case of user3 it is ranked as 1. This is a tremendous difference. Intellectual<>Peach<>Parer<>713651 - ll3 Intellectual<>Peach<>Parer<>1 - user3 For the ll3 we start to see trigrams that consist of words made up of alphanumeric characters halfway through the rankings. By contrast in the user3 ranking table we see this phenomena from the top of the table. I believe that this is because the likelihood measurement considers both the expected and observed frequencies but the t-test considers the sample mean and variance together with the distribution mean which equals the probability of obtaining an element at random with the same mean and variance as that of the NULL hypothesis (H0).In trigrams the marginal tables are very big because of the large number of combinations and so this will influence the result of the ll3. I could not find a clear cutoff point for the ll3 rankings. It seemed that there were a mixture of good collocations mixed everywhere in the table. There was not a clear cut breaking point. OVERALL RECOMMENDATION ---------------------- In this experiment I will also recommend the user3 measure (t-test) because it identifies collocations better as compared to the ll3 measure. REFERENCES: ---------- - Foundations of statistical Natural Language Processing - Manning and Schutze - The analysis of Cross-Classified Categorical Data - Fienberg - Measures of Association - Albert M. Liebetrau - Statistical Tachniques for Data Analysis - John Keenan Taylor - Multiway Contingency table analysis for the social sciences - Thomas D. Wickens ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Name : Nitin Agarwal Date 10-Oct-02 Class : Natural Language Processing CS8761 ========================================================= Objective To find a test of measure such that it holds good for bi-grams as well as trigrams and implement it on a corpora, and then finally compare this measure with some standard measures. Procedure First, N-gram Statistics package (NSP) version 0.51 needs to be downloaded and installed. Once, this has been done, a test of measure needs to be found that can be used with both 2 and 3 word sequences. Now, we need to implement this test of measure on a huge corpus to identify the 2 and 3 word collocations in the corpus. The implementation will consist of four .pm files namely tmi.pm, user2.pm, ll3.pm and user3.pm. Corpus The corpus used for this assignment is corpus.txt. Measure of association T-test has been used. The formula for T-test is as given below t-score = O11-E11/sqrt (O11) where O11 is observed value and E11 is expected value The measure of association was found on http://www.collocations.de/EK/am-html/ Conventions used In this experiment, all the output files have an extension .op2 and .op3 depending on whether it is for a bi-gram or a tri-gram respectively. Output file name is same as the name of the file being processed. Experiment 1 ------------ TOP 50 COMPARISON Fiftieth bi-gram in tmi.op2 is ranked at 29, whereas in user2.op2 this row is occupied by a rank 49 bi-gram. Hence, we observe that in the latter file, except one all other bi-grams up to the rank of 50 have unique scores which is not the case with tmi.op2. From the output files it can be seen that when compared to top 50 collocations in tmi.op2, user2.op2 has more collocations with higher frequency of occurring together. CUTOFF POINT In user2.op2, from the line 45312 the value of the score goes negative. All the tri-grams above this line have positive value. I could not find any cutoff point for ll.op2. The possible reason behind this is that the scores calculated using this measure are pretty high and hence do not reach a negative value. RANK COMPARISON tmi.op2 Vs mi.op2 ----------------- Rank correlation coefficient = 0.0874 As the value of the rank correlation coefficient is just above 0, the two files almost unrelated to each other and have very small amount of relatedness. One important point worth noticing is that in mi.op2, all of first 121 bi-grams have the same rank 1. However, conversely in tmi.op2 most of the bi-grams on the top of the list have unique scores. As we move down the list, some of the bi-grams seem to be sharing the same rank and the numbers of these bi-grams seem to be increasing at an almost steady rate. Looking at the 3 right-most columns, it can be noticed that they all have value 1 for first ranked bi-grams in mi.op2, whereas in tmi.op2 the value for the number of occurrences (third column of numbers) does not seem to be following any progression. This shows that mi.op2, lists all the collocations that occur once in the beginning. One interesting point, that I noticed which may not be very relevant is that, in tmi.op2 the score of the highest ranked bi-gram is 0.0387 which is close to the inverse of the highest ranked bi-gram of mi.op2 whose score is 17.0977. --output of tmi.pm ,<>and<>1 0.0387 2634 11467 4775 I<>had<>2 0.0174 810 5158 1546 ;<>and<>3 0.0149 844 2330 4775 of<>the<>4 0.0086 817 3558 5854 --output of tmi.pm SEND<>MONEY<>1 17.0977 1 1 1 isle<>Fernando<>1 17.0977 1 1 1 COUP<>DE<>1 17.0977 1 1 1 repenting<>prodigal<>1 17.0977 1 1 1 tmi.op2 Vs ll.op2 ----------------- Rank correlation coefficient = 0.1916 Even in this comparison the rank correlation coefficient is positive but a very low value. Hence, we infer that again the two output files have only a small degree of relatedness, although this time it is more than the previous comparison. This time again for ll.op2 the all the collocations on the top of the list occur only once. In tmi.op2 the highest score is 0.0387 and the lowest value is 0. Whereas, in the case of ll.op2, highest value is 7516, which again goes all the way down to 0. Hence, we observe that the gradient for ll.op2 is much steeper than the gradient for tmi.op2. --output of ll.pm ,<>and<>1 7516.6877 2634 11467 4775 I<>had<>2 3387.5897 810 5158 1546 ;<>and<>3 2893.9474 844 2330 4775 ;<>but<>4 1675.1604 349 2330 958 user2.op2 Vs mi.op2 ------------------- Rank correlation coefficient=0.5824 The rank of 0.5824 shows that the two files are neither completely unrelated nor they match perfectly. They are somewhere between the two. As obvious from the rank value, user2.op2 is closer to mi.op2 when compared with tmi.op2. One similarity between the two files is that towards the end, the scores attain negative values in both the files. Again in this comparison, in user2.op2 in the beginning all the scores are unique which is not the case with mi.op2. --output of user2.pm ,<>and<>1 43.7160 2634 11467 4775 I<>had<>2 26.4628 810 5158 1546 ;<>and<>3 26.3213 844 2330 4775 of<>the<>4 23.3878 817 3558 5854 user2.op2 Vs ll.op2 ------------------- Rank correlation coefficient=0.8393 As seen from the coefficient value for these 2 files, they are nearly a perfect match. Most of the rows in the two files have nearly the same collocations. The only major difference between the two files is the value of the scores for the individual bi-grams which is far apart in the beginning gets closer towards the middle and again starts to diverge at the end. Once more, user2.op2 is closer to ll.op2 which is clear from its rank correlation coefficient. OVERALL RECOMMENDATION I would suggest ll.pm as a good measure for identifying significant collocations. As I have discussed above user2.op2 and ll.op2 have unique ranks for most of the bi-grams. The total number of unique ranks for user2.op2 is 12767 whereas for ll.op2 it is 19482. Since ll.op2 has more unique ranks than user2.op2 it is a better measure of association. Experiment 2 ------------ TOP 50 RANK In ll3.op3, most of the top 50 elements have rank 1, whereas in user3.op3, most of the top 50 entries have different ranks. ll3.op3 lists all the tri-grams which occur once at the bottom, whereas user3.op3 lists a trigram that occurs 6 times first and then this number starts decreasing before it starts increasing again towards the end of the file. --output of user3.pm by<>Daniel<>Defoe<>1 2.3877 6 590 6 6 6 6 6 PROJECT<>GUTENBERG<>tm<>2 2.2353 5 7 7 5 7 5 5 Small<>Print<>!<>3 2.2287 5 5 5 92 5 5 5 this<>Small<>Print<>4 2.1764 5 749 5 5 5 5 5 Illinois<>Benedictine<>College<>5 1.9998 4 4 4 4 4 4 4 --output of ll3.pm data<>,<>transcription<>1 392089.1000 1 1 11467 1 1 1 1 ,<>shame<>opposed<>1 392089.1000 1 11467 1 1 1 1 1 los<>Santos<>,<>1 392089.1000 1 1 1 11467 1 1 1 computerized<>population<>,<>1 392089.1000 1 1 1 11467 1 1 1 temperance<>,<>moderation<>1 392089.1000 1 1 11467 1 1 1 1 CUTOFF POINT In user3.op3, from the line 8634 the value of the score goes negative. All the tri-grams above this line have positive value. In the case of ll3.op3, there doesn't seem to be any cutoff value. OVERALL RECOMMENDATION For tri-grams, I would suggest user3.op3 for the same reasons that were given for bi-grams. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Kailash Aurangabadkar CS8761 Assignment #2 -------------------------------------------------------------------------------- The objective of the assignment is to investigate various measures of association that can be used to identify collocations in large corpora (CORPUS) of text. The assignment is essentially to find out interesting co-occurrences of two or more words i.e. their co-occurrence is not coincidental. Such words group to form a collocation. -------------------------------------------------------------------------------- Collocations: A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. Examples of Collocations: Collocations include noun phrases like "strong tea" and "weapons of mass destruction", phrasal verbs like "to make up", and other stock phrases like the "rich and powerful", "a stiff breeze" but not a "stiff wind", "broad daylight" etc. Criteria for Collocations Typical criteria for collocations: non-compositionality, non- substitutability, non-modifiability. Collocations cannot be translated into other languages word by word.A phrase can be a collocation even if it is not consecutive (as in the example knock . . . door). Compositionality A phrase is compositional if the meaning can predicted from the meaning of the parts. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. E.g. strong tea. Idioms are the most extreme examples of non-compositionality. E.g. to hear it through the grapevine. Non-Substitutability We cannot substitute near-synonyms for the components of a collocation. Non-modifiability Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. ---------------------------------------------------------------------------------- Process:- The assignment is an addition to the NSP package currently available and developed by Satanjeev Banerjee. The implementation consists of four .pm files that will be used by statistic.pl program that comes with NSP. These .pm files will only run when used with NSP. ---------------------------------------------------------------------------------- The assignment consists of two experiments:- Experiment 1: To implement (true) mutual information and a test of association, that is not implemented in NSP, for bigrams. The association tests implemented in NSP are: Left Fisher, Chi-Squared, Dice, Point wise Mutual Information, and Log Likelihood. I have selected to implement u-test for collocations. The implementation of u-test is done in user2.pm. I have also implemented the student t-test for association. The implementation for t-test is done in tscore.pm. After implementing the three tests I executed these on bigrams collected from CORPUS, which was obtained by concatenating all the text files in BROWN1. The null hypothesis for the u-test and the t-test is that the two words are independent. Top 50 comparisons: I have executed the user2.pm, tmi.pm and tscore.pm on a corpus of text which consists of 1336151 bigrams which is the concatenation of all text files in /home/cs/tpederse/CS8761/BROWN1. This text is available at /home/cs/aura0011/CS8761/nsp/CORPUS . User 2: We find the following significant collocations in the top 50 ranks when executed on CORPUS: Breezy clotheslines, Peace Corps, Chemical Name, Southern California, final solution, policy trends, Agatha Christie, Hong Kong, Rhode Island, export import, testimony whereof, Electronic Circuits, Polypropylene glycol, barbarian hordes, output control, polio vaccine, freight car, armed force etc. It is observed that the u-test gives significant collocations, but also gives a lot of insignificant ones with them. The u test is dependent on the independent and joint frequency of the two words. If the two words tend to occur together always and each word in that collocation occurs only rarely in any other construct then u-test recognizes them as good candidates of collocation. But collocation made up of two more frequent words like bitten and down are ranked low as the two words making up the collocation also occur without each other quite often. These are the some of the entries in the output file created by user2.pm 1336151 ..... Rosy<>Fingered<>1 1155.9191 1 1 1 dabhumaksanigalu<>ahai<>1 1155.9191 1 1 1 Alcohol<>ingestion<>1 1155.9191 1 1 1 URETHANE<>FOAMS<>1 1155.9191 1 1 1 breezy<>clotheslines<>1 1155.9191 1 1 1 ..... MICROMETEORITE<>FLUX<>26 943.8026 2 2 3 Unifil<>loom<>27 943.8005 4 4 6 PEACE<>CORPS<>27 943.8005 4 6 4 JUNE<>18_<>27 943.8005 4 6 4 ...... DEAE<>cellulose<>30 909.4587 13 13 21 Ham<>Richert<>31 895.3684 3 3 5 Final<>Solution<>31 895.3684 3 5 3 Sheraton<>Biltmore<>31 895.3684 3 5 3 Pro<>Arte<>31 895.3684 3 3 5 T score: There are no any significant collocations in T score as it is based on frequency of occurrences. Some enteries are:- 1336151 of<>the<>1 78.1686 9181 36043 62690 .<>The<>2 66.7814 4961 49673 6921 in<>the<>3 60.4230 5330 19581 62690 ,<>and<>4 57.7736 5530 58974 27952 .<>He<>5 46.9438 2421 49673 2991 on<>the<>6 40.4487 2196 6405 62690 .<>It<>7 39.4899 1718 49673 2184 to<>be<>8 37.3632 1631 25725 6340 ,<>but<>9 37.3273 1648 58974 3006 to<>the<>10 36.0291 3266 25725 62690 .<>In<>11 34.5555 1320 49673 1736 .<>I<>12 34.0372 1565 49673 5877 at<>the<>13 31.9296 1448 4966 62690 .<>But<>14 31.8652 1115 49673 1371 for<>the<>15 30.4946 1655 8833 62690 from<>the<>16 29.6802 1243 4190 62690 Mutual Information:- We find the following significant Collocations in the top 50 collocations: United States, Rhode Islands, Peace corps, Bang Jensen, Air Force, Fiscal year, nineteenth century, President Kennedy, United Nations, Civil War, New Orleans, The Editor, Kansas city, Nuclear Weapons, High School, Vice President, York city, Export Import, Supreme court, St. Louis etc. We find that the mutual information is good in finding out collocations which are made of proper nouns. Some enteries in the output are as follows:- 1336151 .... on<>the<>10 0.0031 2196 6405 62690 United<>States<>11 0.0029 346 456 447 had<>been<>12 0.0028 721 5096 2470 .... years<>ago<>32 0.0008 126 948 246 Rhode<>Island<>32 0.0008 77 89 129 .... ominant<>stress<>38 0.0002 28 62 108 There<>were<>38 0.0002 74 901 3279 Export<>Import<>38 0.0002 14 15 15 be<>taken<>38 0.0002 52 6340 278 ..... he<>asked<>38 0.0002 57 6799 392 nuclear<>weapons<>38 0.0002 20 110 61 But<>it<>38 0.0002 90 1371 6873 Cutoff Point: User 2: By scanning through the output of the user2.pm module I find the cutoff point approximately when u value is around 12. Below 12 u-value the collocations given are the bigrams made up of common words. T-Score: T-score gives a scattered distribution of collocations, with more frequently occurring bigrams at top and neglecting the lower frequency, but important, ones. Thus a cut off value cannot be suggested for T score. Mutual Information: As mutual information depends on observed and expected frequencies a cut off is difficult to suggest. Rank Comparison: "True" Mutual Information:- When rank.pl is executed with output of "True" mutual information and output of Pointwise Mutual Information the value returned is -0.96, which suggests that mutual information and pointwise mutual information are not the same measures, but in fact are opposite as the corpus size goes on increasing. When I tried them on a smaller corpus of 700000 bigrams then rank.pl returned a value of -0.7. When rank.pl is executed with output of "True" mutual information and output of Log likelihood ratio measure the value returned is -0.96, which suggests that mutual information and Log likelihood are also not similar. In fact these two also go on differing as the Corpus size increases. Mutual Information tends to remove the discrepancies of likelihood ratio measures and pointwise mutual information. So it is a measure not like any of them. User 2: When rank.pl is executed with the values returned by u-test and pointwise mutual information then we find that the value turns out to be 0.9641 while those of u-test with those of Log likelihood ratio is 0.93. The measures pointwise mutual information, Log likelihood ratio and u-test are based on the calculation on the observed and expected values of the bigram. While the "True" mutual information is actually is a calculation on all the values in the contingencies table and try to analyze the Corpus on a whole. Thus it is different than the other three. Overall Recommendation: User2 :- This test is also based on the observed and expected values. But, it is more dependent on the difference between joint frequency and independent frequency. Thus it ranks less frequently occurring bigrams higher than those occurring more frequently. The observations from investigations also suggest that the highly ranked bigrams are those whose joint frequency and independently frequency does not differ. This test gives a relatively good measure of collocations as can be seen from the output. T-Score:- In cases, where n and c are not very frequent (and most words are infrequent), and where the corpus is large, then f(n)f(c)/N, will be a very small number . In such cases, subtracting this number from f(n,c) will make only a small difference. It follows that T = approx f(n,c) / sqrt(f(n,c)) = sqrt(f(n,c)) Therefore the main factor in the value of T is simply the absolute frequency of joint occurrences. The T-value picks out cases where there are many joint occurrences, and therefore provides confidence that the association between n and c is genuine. But, clearly, whether we rank order things by raw frequency or by its square root makes no difference. However, T is sensitive to an increase in the product f(n)f(c). In such cases T = [f(n,c) - X] / sqrt(f(n,c)), where X is significantly large relative to f(n,c). Since T decreases if f(n)f (c) becomes very large, the formula has a built-in correction for cases involving very common words. In practice, this correction has a large effect only with a small number of common grammatical words, especially if they are in combination with a second relatively common word. If the corpus gets larger, but f(n)f(c) stays the same, then f(n)f(c)/N decreases again, and T correspondingly increases. Thus T is larger when we have looked at a larger corpus, and can be correspondingly more confident of our results. Again, this effect is noticeable only in cases where node and/or collocate are frequent. But the cases of frequent collocates also form the drawback of the test. This test does not give importance to infrequent collocations. Mutual Information: Mutual Information score expresses the extent to which observed frequency of co-occurrence differs from expected (where expected means "expected under the null hypothesis"). But, there is problem with Mutual Information suppose a word appears just once with another word (which is not an unreasonable event) in the corpus. And the first word may have a very low overall frequency. Say the first word occurs only once i.e. in a bigram with the second word. Now we carry out the sums to calculate the expected value of the bigram and if corpus size is large, we get a very small expected value (about 1/1000). Now we see that even if the first word occurs just once with the second word the observed frequency will be 1000 times more than the expected joint frequency, and the mutual information value will be high. Also when I experimented by executing the count module by setting a window size of 3 for bigrams, I found some interesting bigrams like "rescue prisoners". These were not present in the output with default window size as there are some two-word collocations that have some other constant number of words between them. Like the one mentioned (rescue prisoners). It is not present in the output in the first case as a phrase like "rescue the prisoners" might have occurred. This experiment was executed on a corpus of smaller size as it the number of bigrams increases exponentially with increase in window size. -------------------------------------------------------------------------------- Experiment 2: To implement a module named ll3.pm that performs log likelihood ratio for 3 word sequences. To create the 3 word version of user2.pm and name it user3.pm. A 3 word version of the u-test is implemented in user3.pm. For both of these the null hypothesis is extended from that for two words to three words i.e. all the three words are independent. Top 50 Rank: User3: We find the following interesting collocations in the top 50 ranked trigrams: Getting along with, Rural Road Authority, Plant feeding facilities, Sound Tax policy, United Nations Day, long term approach, Deal with principles, state automobile practices, on unpaid taxes, peace corps volunteers, middle Atlantic states etc. It gives good candidates to be considered as collocations, but is not accurate and tend to give the less frequent word combinations in higher position than more frequent ones. Some of the entries are.: 1336150 _FRANKFURTER<>TWISTS_<>Blend<>1 1336150.0000 1 1 1 1 1 1 1 DEFINE<>INPUT<>OUTPUT<>1 1336150.0000 1 1 1 1 1 1 1 .... BED<>PRESSED<>OR<>38 243946.4984 1 2 1 15 1 1 1 PUSH<>UPS<>BUT<>38 243946.4984 1 2 3 5 2 1 1 PLANT<>FEEDING<>FACILITIES<>38 243946.4984 1 3 2 5 1 1 1 OR<>SODIUM<>HEXAMETAPHOSPHATE<>38 243946.4984 1 15 2 1 1 1 1 .... Dried<>rumen<>bacteria<>41 236200.1814 1 2 2 8 1 1 1 broiled<>steaks<>tantalizing<>41 236200.1814 1 2 4 4 1 1 1 GETTING<>ALONG<>WITH<>41 236200.1814 1 2 1 16 1 1 1 ..... INORS<>MUST<>ALSO<>41 236200.1814 1 2 8 2 1 1 1 0895<>NATURAL<>GAS<>41 236200.1814 1 4 4 2 1 1 1 SOUND<>TAX<>POLICY<>41 236200.1814 1 1 8 4 1 1 1 Accepted<>crystallographic<>symbolism<>41 236200.1814 1 2 2 8 1 1 1 Log likelihood 3: There are not so many interesting trigrams in top 50 rankings of the ll3.pm output. Following are the first 15 entries in the output 1336150 ,<>.<>The<>1 27511.2605 7 58974 49673 6921 57 30 4961 of<>.<>The<>2 25527.1297 2 36043 49673 6921 25 2 4961 a<>.<>The<>3 24045.6938 1 21998 49673 6921 15 2 4961 to<>.<>The<>4 24041.1144 3 25725 49673 6921 47 5 4961 .<>The<>.<>5 24006.2920 1 49673 6921 49674 4961 335 1 in<>.<>The<>6 23023.9802 11 19581 49673 6921 81 13 4961 is<>.<>The<>7 22535.4368 2 9969 49673 6921 29 3 4961 he<>.<>The<>8 22493.5350 1 6799 49673 6921 3 3 4961 for<>.<>The<>9 22438.9843 4 8833 49673 6921 28 4 4961 was<>.<>The<>10 22413.8054 1 9770 49673 6921 37 5 4961 with<>.<>The<>11 22337.1213 2 7007 49673 6921 19 2 4961 his<>.<>The<>12 22239.1767 3 6469 49673 6921 21 5 4961 at<>.<>The<>13 22215.6165 1 4966 49673 6921 10 1 4961 from<>.<>The<>14 22191.3867 1 4190 49673 6921 5 1 4961 had<>.<>The<>15 22155.1802 2 5096 49673 6921 17 2 4961 Cutoff point: It is difficult to spot collocations for trigrams as there are more possibilities of word combinations. So it is not quite easy to find out the cutoff point. User3: As we skim through the output of user3.pm we find that the cutoff point occurs somewhere in the range where the value of u - test is greater than 2000 for the current CORPUS. The entries having u-value more than 2000 are composed of less frequent words. The entries below 2000 value have more frequent words like THE, IS, or a punctuation as a word in them and hence are weak candidates of collocations. Log likelihood ratio 3: It gives a rather scattered distribution of the collocations and hence a cutoff cannot be suggested. Rank Comparison: When rank.pl is executed on outputs of Log likelihood ratio and output of User3 we get a value of 0.4726. This suggest that both the tests are not related to each other. Overall Recommendations: The u-test for collocations gives a good result for finding collocations that do not occur frequently. To find out the infrequent word combination collocations we have to carefully go through the output created. By infrequent word combination collocation we mean the collocations formed by words which occur independently more often than they occur in the collocation. The Log likelihood ratio measure gives better results for trigrams that occur more often than those which occur less often. But the trigrams which occur most are made up of common articles (like The) or verbs (like Is) or punctuation marks. Thus the high ranked trigrams in the output of ll3.pm are not significant collocations. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ CS8761 Natural Language Processing Assignment 2 Archana Bellamkonda October 11, 2002. Problem Description : -------------------- The objective is to investigate various measures of association that can be used to identify collocations in large corpora of text. Following the investigation, a measure is selected and implemented so that it can be used with 2 and 3 word sequences to find the association between the words. This measure is also compared with some other standard measures and subsequent analysis is done. [ Measure of Association is given by a value calculated from the method selected and that value determines the chance of the words in a bigram or trigaram ( or n-gram in general) to be together, i.e. we get information about mutual dependence between the words. ] Input and Output Description : ---------------------------- Our aim is to calculate a score for each bigram or trigram that determines the associativity between the words. In this experiment, we write modules for calculating value based on true mutual information (tmi.pm), log-likelihood ratio for trigrams (ll3.pm), and a new measure for bigrams (user2.pm) and trigrams (user3.pm). Program, statistic.pl in NSP package calculates these values for each bigram or trigram, where we have to specify name of the module that we are using, a file to write the results into and an input file that is in the format output by count.pl in NSP package. [count.pl finds all n-grams in a file and writes them into the file specified, along with various frequency values associated with each n-gram]. Output of count.pl is used as input for statistic.pl. Example: ------ If abc.txt is the text on which we want to anaylse, then we can run a simple count.pl with the following syntax: perl count.pl result.txt input.txt Above syntax is for bigrams. If we need the same for trigrams, NSP package allows us to write, perl count.pl --ngram 3 result3.txt input.txt [Other ways to run are specified in the package. ] Once we get result.txt or result3.txt, containing all bigrams or trigrams along with some frequency values, we use that file as input to statistic.pl, that will calculate score for each bigram or trigram and store the results in an output file specified by the user. We should also specify the method that is to be used to calculate scores. To calculate scores using method specified in module user2.pm, we can write the simple command, perl statistic.pl user2 user2.res result.txt For trigrams, we can run statistic.pl as perl statistic.pl user3 --ngram 3 user3.res result3.txt Thus, we can compute scores that tell about association between words in an n-gram and get results in a file specified by user. Description of Method Used : -------------------------- user2.pm : -------- In user2.pm, I used Odds Ratio as a measure of associativity. [ Found in book, Categorical Data Analysis by ALAN AGRESTI, which I found by searching in library ] If X<>Y is a bigram, then Odds Ratio for it is given by the formula, Odds Ratio = ( frequency of X and Y occuring together * frequency of both X and Y not occuring together) / ( frequency of X occuring and Y not occuring * frequency of Y occuring and X not occuring) Note : Above, when we mention about occurance of X or Y, we are mentioning occurance of X in the first position and occurance of Y in second position for a bigram. As we have frequency of X and Y occuring together in numerator, and frequency of both X and Y not occuring together (which means that if this value increases, there is less chance that there will be bigrams that dont contain X in first position and Y in second position) in numerator, if these values increases, intuitively, there is stronger association between words and hence Odds Ratio will increase. Similarly, if frequency of either X or Y occuring in the correct position increases, it means that frequency of X and Y not occuring together increases and hence association is low. As we have these frequencies in denominator, Odds Ratio decreases, and hence it makes sense. Thus, Odds Ratio is calculated for each bigram and we can know the association between two words in a bigram depending on Odds Ratio. Greater the ratio, greater is the association between them. Example :- Say, i<>j is a bigram, and we have, (first bigram in output given by count.pl when run on Brown.txt) nij = number of times i<>j occured in the sample = 9181 ni+ = number of times i<>+ occured in the sample (second word can be anything) = 36043 n+j = number of times +<>j occured in the sample (first word can be anything) = 62690 total = total number of bigrams in the sample = 1336151 now we will get, nibarj = number of times i occured in first place where j is not there in second place = n+j - nij = 62690 - 9181 = 53509 nijbar = number of times j occured in second place where i is not there in first place = ni+ - nij = 36043 - 9181 = 26862 nibarjbar = number of times when neither of i and j occured = total - ni+ - n+j + nij = 1336151 - 36043 - 62690 + 9181 = 1246599 Odds Ratio = nij * nibarjbar / (nijbar * nibarj) = 9181 * 1246599 / ( 26862 * 53509) = 7.9625 user3.pm : ------- The same concept of Odds Ratio can be extended to apply for trigrams. Let X<>Y<>Z be a trigram. Mantel and Haenszel proposed the formula for calculating Odds Ratio for trigrams to be Odds Ratio = (for all k, (n11k * n22k / n++k) ) / (for all k, (n12k * n21k / n++k) ) [Page 236 in the book, Categorical Data Analysis] where n111 = frequency of occurance of trigram XYZ, n112 = frequency of occuracne of trigram where X,Y are in first and second positions and Z is not in third position, n121 = frequency of occurance of trigram where X,Z are in first and third positions and Y is not in second position, n122 = frequency of occurance of trigram where X occurns in first place and Y,Z are not in second and third positions, . . . . and so on n++1 = n111 + n121 + n211 + n221 and n++2 = n112 + n122 + n212 + n222 and k = 1,2 for trigram. Odds Ratio makes sense for trigrams also as in bigrams as increase in value of numerator implies greater associativity intuitively. Example :- ------ total = 1071112 n111 = xyz = 2179 xpp = 35891 pyp = 35891 ppz = 35891 xyp = 3751 xpz = 2760 pyz = 3751 n121 = xybarz = xpz - xyz = 2760 - 2179 = 581 n211 = xbaryz = pyz - xyz = 3751 - 2179 = 1572 n112 = xyzbar = xyp - xyz = 3751 - 2179 = 1572 n221 = xbarybarz = ppz - n121 - n211 - xyz = 35891 - 581 - 1572 - 2179 = 31559 n122 = xybarzbar = xpp - n112 - n121 - xyz = 35891 - 1572 - 581 - 2179 = 31559 n212 = xbaryzbar = pyp - n112 - n211 - xyz = 35891 - 1572 - 1572 - 2179 = 30568 n222 = total - xpp - pyp - ppz + xyp + xpz + pyz - xyz = 1071112 - 35891 - 35891 - 35891 + 3751 + 2760 + 3751 - 2179 = 97152 npp1 = n111 + n121 + n211 + n221 = 2179 + 581 + 1572 + 31559 = 35284 npp2 = n112 + n122 + n212 + n222 = 1572 + 31559 + 30568 + 97152 = 160851 odds ratio = (npp2.n111.n221 + npp1.n112.n222) / (npp2.n121.n211 + npp1.n122.n212) = EXPERIMENT 1 : ------------ Once user2.pm is implemented, it is run with NSP to perform analysis on the following. TOP 50 COMPARISION : ------------------ --> Number of Ranks : --------------- On observing the ranks produced by user2.pm and tmi.pm, and considering the top 50 ranks, it is observed that there are only 40 ranks for the text I gave when we use tmi.pm. So, I couldnt even observe 50 ranks even though the text consists of more than a million tokens. But there are 131972 ranks when user2 module is used. Analysis : The reason for getting less number of ranks is the value of the score being zero for most of the bigrams. The score is zero implies that (observed frequency of 2 words being together) / (expected value for that words being together) is negative. Because only then the value is not calculated in the logic of the program. [log is calcualted only for values greater than 0.] There is no chance that expected value is negative. [Since it is product of left and right frequencies of a bigram divided by total number of bigrams and all these values are taken from count.pl and they are never negative.]. SO, the only possibility is that observed frequency for 2 words being together can be zero but never negative. So, this implies that if the value, (observed frequency of 2 words being together) / (expected value for that words being together) is valid, then only possibility to get zero would be if observed frequency is equal to expected frequency. [because log 1 = 0, and that is the only way we can get zero] NOTE: ------ Formula used for tmi is: score = for all X and Y ( (nij / n++) * log ( nij / eij) ) where log is to base 2. Conclusion for this Analysis : Using tmi, we are observing that observed value is equal to expected in relatively more number of cases which means that words in bigram are independent. Thus, Null Hypothesis can be accepted in this case. --> Comparision between user2 and tmi : --------------------------------- I observed the bigrams that are ranked high using tmi are ranked low using user2. If we are considering a particular text and performing analysis on it, I feelt that user2 would be more interesting than tmi as it gives wider rankings, so that we can have better intuition regarding words of that particular text. In my experiment, I observed that using user2, bigrams that are some sort of names, or common phrases (atleast to that text) are ranked high. Using tmi, I observed that bigrams that are more common in general language, for example, ".<>The<>" (ranked 1) and not to particular text are ranked high. Hence, I feel that user2 would be significantly better than tmi for analysis of texts of particular author, or particular category, etc but tmi would be better than user2 to perform analysis on a language rather than on a specific text. CUTOFF POINT : ----------- tmi.pm : For scores generated using tmi module, 0.0135 is the score for first rank in my experiment. And the score goes on decreasing with increase in rank and at a stage it remains constant once it reaches zero. We can neglect all the bigrams whose score is zero. This means that the words in the bigrams with score zero have no dependence or association and hence we can neglect them. Hence the Cut-Off point will be the point where we encounter zero as value for measure. user2.pm : For scores generated using user2 module, 1336150 is the score for first rank in my experiment. Thereon, score goes on decreasing as rank increases but I didnt observe any cut off point beyond which I can neglect the values. I didnt find a cut off as the scores are not normalized or within a certain boundary. When there are boundaries, we can find cutoff values at the boundaries. But here as we lack specific boundaries, we are unable to determine the cutof point. This tells us that this test not suitable for all texts. The reason is, as the size of corpora increases, the value of odds ratio goes on increasing and this is not very desirable. Escpecially when we have cases where a bigram occurs only once in a very large corpora, the value of (frequency of occurances without X and Y) in the numerator increases considerably and we dont have a way to normalize this. Thus, we can tell that Odds Ratio test is suitable only for texts of reasonable size with accpetable frequency values for the occurances of bigrams. Also, we can conclude that we didnt find a cutoff using user2.pm as it is a variation of size. When we have a variant of size, we can have a cut off generally by using a function of size, and cutoff changes with size. But here, it is not clear what the relation is between size of the text and Odds Ratio. Hence, we am unable to determine a cutoff for this experiment. RANK COMPARISION : ---------------- Comparision with ll.pm :- The rank corelation coefficient on comparing ll.pm with tmi.pm is -0.9668 The rank corelation coefficient on comparing ll.pm with user2.pm is 0.8349 Comparision with mi.pm :- The rank corelation coefficient on comparing mi.pm with tmi.pm is -0.9668 The rank corelation coefficient on comparing mi.pm with user2.pm is 1.0000 Observing the rank co-relation coefficients above, we can conclude that user2 gives exactly the same rankings as mi.pm, which means that there is high relatedness between mi.pm and user2.pm. Also, user2.pm has high relatedness to ll.pm. Thus we can tend to infer that user2 is a good measure of association as it has good co-relation with the existing measures. This is true as long as the drawback observed when determining cufoff point above is not significant. --> Related but Negative value :- -------------------------- One interesting observation made is that comparing with tmi.pm gives us negative coefficient. But if we observe actual data in the results of both modules, they have very similar rankings. For example, consider first 10 rankings given by ll.pm and tmi.pm : ll.pm: ---- 1336151 .<>The<>1 25043.4196 4961 49673 6921 of<>the<>2 18856.1038 9181 36043 62690 .<>He<>3 13183.4486 2421 49673 2991 in<>the<>4 11393.6840 5330 19581 62690 .<>It<>5 9139.3156 1718 49673 2184 ,<>and<>6 9072.9347 5530 58974 27952 .<>In<>7 6844.0270 1320 49673 1736 ,<>but<>8 6309.5959 1648 58974 3006 the<>the<>9 6126.7892 3 62690 62690 .<>But<>10 6064.5057 1115 49673 1371 tmi.pm : ----- 1336151 .<>The<>1 0.0135 4961 49673 6921 of<>the<>2 0.0102 9181 36043 62690 .<>He<>3 0.0071 2421 49673 2991 in<>the<>4 0.0062 5330 19581 62690 .<>It<>5 0.0049 1718 49673 2184 ,<>and<>5 0.0049 5530 58974 27952 .<>In<>6 0.0037 1320 49673 1736 ,<>but<>7 0.0034 1648 58974 3006 .<>But<>8 0.0033 1115 49673 1371 the<>the<>8 0.0033 3 62690 62690 to<>be<>9 0.0032 1631 25725 6340 on<>the<>10 0.0031 2196 6405 62690 So, it is clear that both the measures must have high relatedness as the rankings are nearly the same. That means, we have to get a value around +1(expected). But we observed a value around -1 (observed value = -0.9668). This is an interesting observation that can be made. The reason for this is, tmi is normailized and so we could observe a cutoff at 0 and all the bigrams thereafter can be neglected as all of them have the same rank (only 40 ranks are observed in this experiment). But ll and mi dont show the same rank to so many bigrams and moreover the scores do depend on size of corpora and we have many rankings. (for example, 106744 rankings are observed in ll). Since difference in rankings is the criteria in calculating the corelation coefficient, these differences produce a negative value. Therefore, though these methods seem to have high corelatedness, they actually end up having high reverse corelatedness. OVERALL RECOMMENDATION : ---------------------- Based on all the observations made, I think the best method for identifying significant collocations in a large corpora among mi.pm, ll.pm, user2.pm and tmi.pm would be "mi.pm" or "ll.pm" user2.pm might not be appropriate in cases where most bigrams dont occur more than once, or where the size of corpora is very large, for the reasons explained earlier during observations. tmi.pm would also be inappropriate compared to mi.pm as it quickly converges to cutoff, which excludes bigrams with scores equal to cut off from consideration. EXPERIMENT 2 : ------------ Once user3.pm is implemented, the following are done. TOP 50 COMPARISION :- ------------------ On observing the top 50 bigrams after running NSP on a corpus using ll3.pm and user3.pm, I see that the two methods are significantly different from each other as the bigrams ranked below 50 using ll3.pm and bigrams ranked below 50 using user3.pm are significantly different. Also, it is observed that score values is relatively high using user3 compared to scores using ll3. This could tell that user3.pm is more affected by size of the text CUTOFF POINT :- ------------ No cut off points are observed for either of the methods, ll3.pm or user3.pm, for reasons similar to the bigram case. OVERALL RECOMMENDATION :- --------------------- Among ll3.pm and user3.pm, a better method would be ll3.pm as we get very high ratios using user3.pm as size of corpora increases and user3 is especially degraded when many trigrams occur only once which causes the ratio to increase considerably, as with the case in user2.pm for bigrams. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ #------------------------[experiments.txt]---------------------------- # Assignment # 2 : CS-8761 Corpus Based NLP # Title : Report for "The Great Collocation Hunt" # Author : Deodatta Bhoite (bhoi0001@d.umn.edu) # Version : 1.0 # Date : 10-10-2002 #--------------------------------------------------------------------- Introduction ------------ "You shall know a word by the company it keeps!" - J.R.Firth The most trivial way of finding out collocations in a text would be counting the number of joint occurrences of two words. However, it is seen that we can find more significant collocations by applying various tests of association to the frequency data (contingency table) of the corpus. In this assignment we investigate various tests of association to identify 2-word and 3-word collocations. Experiment 1 ------------ Corpus for this experiment has been taken from Project Gutenberg and includes the following files: Alice in Wonderland (alice.txt) Black Beauty (blackbeauty.txt) Robinson Crusoe (crusoe1.txt) Further Adventures of Robinson Crusoe (crusoe2.txt) Dracula (dracula.txt) King James New and Old Testament (bible.txt) Vikram and the Vampire (vikram.txt) The total token size of the corpus is around 1.37M tokens. However, I remove the digits and the punctuation from the data by specifying a token file to the count.pl. The digits and punctuation are filtered by the following regular expression. --[token.txt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /[a-zA-Z]+/ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We then count find the bigrams and frequency data related to them by the following command: % count.pl --token token.txt test2.cnt /home/cs/bhoi0001/CS8761/nsp/ The top 50 collocations based on frequency counts are as follows. We observe that they are not significant, which is probably because they do not take into consideration the independent occurrence of the words forming the bigrams. --[test2.cnt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 of<>the<>14731 48646 89427 in<>the<>6985 20407 89427 the<>LORD<>5964 89427 6653 and<>the<>5282 59758 89427 to<>the<>3830 30166 89427 all<>the<>2661 8710 89427 shall<>be<>2551 10449 10176 And<>the<>2249 13476 89427 I<>will<>2044 23430 4922 for<>the<>2028 12188 89427 unto<>the<>2020 8944 89427 out<>of<>1953 4357 48646 of<>Israel<>1701 48646 2578 and<>I<>1700 59758 23430 from<>the<>1674 5455 89427 that<>I<>1665 20250 23430 the<>king<>1662 89427 2695 said<>unto<>1643 6088 8944 on<>the<>1643 4949 89427 And<>he<>1637 13476 15266 with<>the<>1624 10543 89427 of<>his<>1594 48646 12329 I<>have<>1575 23430 6484 into<>the<>1565 3251 89427 to<>be<>1564 30166 10176 by<>the<>1520 4865 89427 and<>he<>1494 59758 15266 I<>had<>1445 23430 6896 that<>he<>1420 20250 15266 son<>of<>1418 2305 48646 children<>of<>1403 1936 48646 it<>was<>1400 12324 11852 the<>children<>1346 89427 1936 and<>they<>1337 59758 10570 the<>house<>1325 89427 2368 the<>son<>1322 89427 2305 the<>land<>1297 89427 1885 upon<>the<>1274 3952 89427 the<>people<>1266 89427 2470 I<>was<>1262 23430 11852 I<>am<>1255 23430 1430 that<>the<>1227 20250 89427 at<>the<>1178 4608 89427 of<>them<>1134 48646 9649 unto<>him<>1126 8944 9642 in<>a<>1120 20407 18937 him<>and<>1107 9642 59758 and<>his<>1101 59758 12329 and<>to<>1090 59758 30166 came<>to<>1080 3349 30166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have implemented "true" mutual information and Poisson's collocation measure for finding collocations in the corpus. Description of their implementations can be found in the source code comments. We will see the results of the association measures (AM) and their analysis in the following sections. (a) Top 50 Comparison - - - - - - - - - The two word collocations for True mutual information are found as follows: % statistic.pl tmi test2.tmi test2.cnt There are only 44 ranks in the TMI output. So we show top 50 trigrams as they are ordered in the output. --test2.tmi-- //TMI output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 the<>LORD<>1 0.0152 5964 89427 6653 of<>the<>2 0.0143 14731 48646 89427 shall<>be<>3 0.0075 2551 10449 10176 in<>the<>4 0.0074 6985 20407 89427 the<>the<>5 0.0067 2 89427 89427 I<>will<>6 0.0054 2044 23430 4922 said<>unto<>7 0.0052 1643 6088 8944 thou<>shalt<>8 0.0051 1024 5017 1627 I<>am<>9 0.0049 1255 23430 1430 of<>Israel<>10 0.0043 1701 48646 2578 out<>of<>11 0.0039 1953 4357 48646 children<>of<>12 0.0038 1403 1936 48646 son<>of<>13 0.0034 1418 2305 48646 thou<>hast<>14 0.0033 670 5017 1054 Van<>Helsing<>15 0.0031 305 305 305 I<>have<>15 0.0031 1575 23430 6484 the<>king<>16 0.0030 1662 89427 2695 and<>and<>17 0.0029 2 59758 59758 according<>to<>18 0.0027 734 799 30166 the<>children<>18 0.0027 1346 89427 1936 And<>he<>18 0.0027 1637 13476 15266 it<>was<>19 0.0026 1400 12324 11852 I<>had<>19 0.0026 1445 23430 6896 the<>land<>19 0.0026 1297 89427 1885 all<>the<>20 0.0025 2661 8710 89427 unto<>him<>20 0.0025 1126 8944 9642 had<>been<>21 0.0024 627 6896 1579 to<>pass<>21 0.0024 719 30166 906 into<>the<>22 0.0023 1565 3251 89427 of<>and<>23 0.0022 36 48646 59758 Lord<>GOD<>23 0.0022 290 1169 300 the<>son<>23 0.0022 1322 89427 2305 they<>were<>23 0.0022 921 10570 5349 came<>to<>23 0.0022 1080 3349 30166 the<>to<>23 0.0022 1 89427 30166 the<>house<>23 0.0022 1325 89427 2368 the<>earth<>24 0.0020 910 89427 1142 I<>could<>24 0.0020 782 23430 1969 he<>said<>24 0.0020 1022 15266 6088 unto<>them<>25 0.0019 968 8944 9649 could<>not<>25 0.0019 609 1969 10847 the<>people<>25 0.0019 1266 89427 2470 of<>of<>25 0.0019 9 48646 48646 to<>be<>25 0.0019 1564 30166 10176 Thus<>saith<>25 0.0019 305 514 1262 began<>to<>26 0.0018 542 736 30166 house<>of<>26 0.0018 978 2368 48646 saith<>the<>26 0.0018 891 1262 89427 a<>little<>26 0.0018 588 18937 1189 don<>t<>26 0.0018 219 219 667 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two word collocations for Poisson's AM are found as follows: % statistic.pl user2 test2.u2 test2.cnt The 50 ranks are as follows: --test2.u2-- //Poisson's AM output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 the<>the<>1 5888.7006 2 89427 89427 and<>and<>2 2621.8905 2 59758 59758 the<>to<>3 1984.4361 1 89427 30166 of<>and<>4 1966.1513 36 48646 59758 of<>of<>5 1693.0572 9 48646 48646 the<>I<>6 1539.8723 1 89427 23430 I<>the<>7 1269.7665 68 23430 89427 a<>the<>8 1243.3868 1 18937 89427 to<>and<>9 1078.9410 63 30166 59758 to<>of<>10 1076.6269 1 30166 48646 of<>to<>11 1048.2659 6 48646 30166 the<>he<>12 973.1853 6 89427 15266 Project<>Gutenberg<>13 910.1206 113 163 113 he<>the<>14 904.4241 22 15266 89427 his<>the<>15 807.4520 1 12329 89427 that<>and<>16 803.8059 19 20250 59758 I<>and<>17 803.2322 61 23430 59758 I<>of<>18 798.3679 8 23430 48646 of<>I<>19 764.5523 16 48646 23430 years<>old<>20 761.5457 161 752 965 Holy<>Ghost<>21 739.8175 91 152 91 lifted<>up<>22 716.3992 155 196 3976 the<>not<>23 709.7152 1 89427 10847 on<>board<>24 696.6052 157 4949 192 meat<>offering<>25 681.2251 122 322 730 Madam<>Mina<>26 667.7936 86 92 205 thus<>saith<>27 666.3556 139 467 1262 to<>to<>28 654.2245 3 30166 30166 of<>in<>29 650.7003 18 48646 20407 they<>the<>30 643.4693 11 10570 89427 in<>of<>31 636.3871 22 20407 48646 o<>clock<>32 626.1584 75 95 97 sin<>offering<>33 614.3277 118 455 730 it<>the<>34 582.5441 67 12324 89427 place<>where<>35 582.2297 147 1317 1092 thy<>servant<>36 581.0534 165 4529 554 the<>all<>37 568.8163 1 89427 8710 high<>places<>38 563.4935 98 545 295 thine<>hand<>39 541.2238 145 923 1935 of<>he<>40 536.4604 2 48646 15266 rose<>up<>41 535.2729 123 205 3976 my<>lord<>42 531.6304 154 8795 286 at<>least<>43 525.9582 130 4608 254 pass<>when<>44 523.7818 167 906 4141 burnt<>offerings<>45 520.0814 86 388 271 Dr<>Seward<>46 513.3245 64 139 79 Mrs<>Harker<>47 511.0650 65 102 128 Mock<>Turtle<>48 509.2428 56 56 59 thou<>mayest<>49 497.1772 109 5017 117 their<>fathers<>50 489.2345 153 5767 561 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We find that the collocations in the Poisson's AM output are more interesting than that of the TMI output. ex. Holy Ghost, Mock Turtle, Madam Mina, thy servant, my lord, high places, etc. However, the Poisson's AM top ranks have collocations like "the the" and "and and". This is due to the high value of 'lamda' which is E11. (b) Cutoff Point - - - - - - - The cutoff for TMI output is where rank 38 starts (221st bigram). There are only 44 ranks in the output, out of which the first 221 bigrams have 38 ranks (ranks change faster), whereas the rest of the huge number of bigrams have just 6 distinct ranks distributed between them.(ranks change slower). i.e. There is a clustering of scores. pass<>when<>38 0.0006 167 906 4141 There is no particular cutoff point for the Poisson's AM's output, but we can call the 1st rank as a cutoff point essentially because it's score is unusually high. (c) Rank Comparison - - - - - - - - The rank comparison matrix for the measures is as shown below. TMI USER2 LL -0.9046 0.9773 MI -0.9055 0.6924 We observe that the output of TMI is significantly different (rather opposite) than LL and MI. The Poisson's ratio is similar to log-likelihood [Krenn 00]. Hence, the high rank coefficient is justified. Poisson's AM and MI are strongly correlated but not exactly same. We also observe that though the rank coefficient of TMI and LL is inversely related, the top 30 bigrams are almost the same. (d) Overall Recommendation - - - - - - - - - - - - I found out that the Pointwise mutual information gives very good results. Most collocations (in terms of natural language, not statistically proved) would probably occur once and would occur together. Such usage should be highly ranked. For example: menace<>Monster<>1 20.3690 1 1 1 blending<>contradictions<>1 20.3690 1 1 1 fleeting<>diorama<>1 20.3690 1 1 1 Hence, I would recommend Pointwise mutual information despite its drawbacks. My judgment is based on the results of my experiments. Experiment 2 ------------ The corpus used for the second experiment is the same as that used for the first experiment. However, in this experiment we find the trigrams (rather 3-word collocations) in the corpus. The data is preprocessed using the same token file to remove the non-alpha characters. Note: Before you run the following command you will probably need to change ----- limits for maximum file size and/or cpu time. This can be done by ulimit -f and ulimit -t respectively. Specify any high arbitrary value or make it unlimited. % count.pl --ngram 3 --token token.txt test3.cnt /home/cs/bhoi0001/CS8761/nsp/ The top 50 joint frequency rankings are shown below. --[test3.cnt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 of<>the<>LORD<>1626 48646 89427 6653 14731 1630 5964 the<>son<>of<>1300 89427 2305 48646 1322 25368 1418 the<>children<>of<>1263 89427 1936 48646 1346 25368 1403 out<>of<>the<>979 4357 48646 89427 1953 1290 14731 the<>house<>of<>901 89427 2368 48646 1325 25368 978 children<>of<>Israel<>650 1936 48646 2578 1403 650 1701 the<>land<>of<>618 89427 1885 48646 1297 25368 646 saith<>the<>LORD<>615 1262 89427 6653 891 615 5964 the<>sons<>of<>504 89427 1110 48646 517 25368 598 unto<>the<>LORD<>488 8944 89427 6653 2020 489 5964 the<>LORD<>and<>484 89427 6653 59758 5964 8114 520 it<>came<>to<>464 12324 3349 30166 501 816 1080 came<>to<>pass<>462 3349 30166 906 1080 462 719 and<>all<>the<>461 59758 8710 89427 1037 3875 2661 and<>I<>will<>456 59758 23430 4922 1700 633 2044 said<>unto<>him<>454 6088 8944 9642 1643 504 1126 And<>he<>said<>452 13476 15266 6088 1637 1202 1022 the<>king<>of<>427 89427 2695 48646 1662 25368 851 And<>the<>LORD<>408 13476 89427 6653 2249 409 5964 the<>hand<>of<>406 89427 1935 48646 453 25368 449 of<>the<>house<>393 48646 89427 2368 14731 463 1325 And<>it<>came<>384 13476 12324 3349 622 533 501 of<>the<>children<>364 48646 89427 1936 14731 401 1346 said<>unto<>them<>364 6088 8944 9649 1643 383 968 the<>midst<>of<>334 89427 384 48646 383 25368 334 the<>word<>of<>331 89427 952 48646 448 25368 375 the<>name<>of<>329 89427 1105 48646 350 25368 340 it<>shall<>be<>326 12324 10449 10176 614 667 2551 in<>the<>land<>325 20407 89427 1885 6985 381 1297 according<>to<>the<>323 799 30166 89427 734 341 3830 and<>they<>shall<>320 59758 10570 10449 1337 1217 894 of<>the<>land<>319 48646 89427 1885 14731 376 1297 and<>said<>unto<>314 59758 6088 8944 1043 641 1643 of<>the<>earth<>308 48646 89427 1142 14731 314 910 LORD<>thy<>God<>296 6653 4529 4614 304 675 351 the<>LORD<>thy<>296 89427 6653 4529 5964 319 304 Thus<>saith<>the<>291 514 1262 89427 305 313 891 the<>king<>s<>288 89427 2695 3699 1662 977 300 all<>the<>people<>287 8710 89427 2470 2661 341 1266 house<>of<>the<>287 2368 48646 89427 978 436 14731 and<>in<>the<>286 59758 20407 89427 839 3875 6985 he<>said<>unto<>283 15266 6088 8944 1022 466 1643 the<>LORD<>And<>277 89427 6653 13476 5964 1943 281 one<>of<>the<>270 3676 48646 89427 651 394 14731 word<>of<>the<>268 952 48646 89427 375 310 14731 in<>the<>midst<>267 20407 89427 384 6985 268 383 before<>the<>LORD<>265 2676 89427 6653 742 265 5964 the<>Lord<>GOD<>262 89427 1169 300 731 264 290 LORD<>of<>hosts<>244 6653 48646 300 263 244 285 the<>LORD<>of<>240 89427 6653 48646 5964 25368 263 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We cannot say by looking at the results of the frequency rankings that the collocations found are not significant. I have implemented the log-likelihood AM and Poisson's AM for trigrams. The description of their implementations can be found in their respective source codes. In the subsequent sections, we note the results of these two AMs on the corpus and compare and analyze the findings. (a) Top 50 Comparison - - - - - - - - - The 3-word collocations are found using log likelihood for trigrams as follows: % statistic.pl --ngram 3 ll3 test3.ll3 test3.cnt The top 50 3-word collocations found using log likelihood AM are as shown: --[test3.ll3]-- // Log likelihood output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 the<>LORD<>of<>1 111362.6616 240 89427 6653 48646 5964 25368 263 the<>children<>of<>2 88547.8405 1263 89427 1936 48646 1346 25368 1403 the<>king<>of<>3 88004.4561 427 89427 2695 48646 1662 25368 851 the<>son<>of<>4 87925.7509 1300 89427 2305 48646 1322 25368 1418 the<>house<>of<>5 85487.1624 901 89427 2368 48646 1325 25368 978 the<>land<>of<>6 85410.9035 618 89427 1885 48646 1297 25368 646 the<>earth<>of<>7 84784.2917 3 89427 1142 48646 910 25368 3 the<>people<>of<>8 84239.8952 139 89427 2470 48646 1266 25368 175 the<>same<>of<>9 83991.7332 1 89427 723 48646 662 25368 1 the<>Lord<>of<>10 83229.8499 15 89427 1169 48646 731 25368 25 the<>sons<>of<>11 83141.9206 504 89427 1110 48646 517 25368 598 the<>midst<>of<>12 83020.0501 334 89427 384 48646 383 25368 334 the<>world<>of<>13 82577.2817 7 89427 595 48646 468 25368 16 the<>one<>of<>14 82412.0136 4 89427 3676 48646 141 25368 651 the<>city<>of<>15 82330.9315 115 89427 977 48646 575 25368 158 the<>part<>of<>16 82258.6409 16 89427 574 48646 20 25368 341 the<>sea<>of<>17 82238.1683 20 89427 723 48646 474 25368 28 the<>priest<>of<>18 82170.8559 9 89427 576 48646 414 25368 18 the<>other<>of<>19 82114.2863 5 89427 1397 48646 587 25368 20 the<>first<>of<>20 82101.3345 14 89427 1185 48646 555 25368 29 +the<>word<>of<>21 82025.5217 331 89427 952 48646 448 25368 375 the<>congregation<>of<>22 81959.0340 73 89427 364 48646 328 25368 82 the<>ship<>of<>23 81938.7633 3 89427 617 48646 390 25368 9 the<>tabernacle<>of<>24 81811.1994 174 89427 327 48646 286 25368 175 the<>door<>of<>25 81774.1339 108 89427 538 48646 375 25368 116 the<>hand<>of<>26 81690.8952 406 89427 1935 48646 453 25368 449 the<>end<>of<>27 81681.8431 147 89427 545 48646 260 25368 260 the<>name<>of<>28 81680.6279 329 89427 1105 48646 350 25368 340 the<>ground<>of<>29 81672.5320 2 89427 442 48646 304 25368 5 the<>Levites<>of<>30 81664.8878 3 89427 265 48646 242 25368 3 the<>wilderness<>of<>31 81662.8149 53 89427 314 48646 277 25368 54 the<>tribe<>of<>32 81648.1790 176 89427 242 48646 179 25368 207 the<>ark<>of<>33 81629.7452 139 89427 231 48646 219 25368 147 the<>inhabitants<>of<>34 81537.1688 157 89427 228 48646 187 25368 178 the<>Son<>of<>35 81468.1143 127 89427 293 48646 162 25368 203 the<>altar<>of<>36 81467.6858 58 89427 379 48646 277 25368 67 the<>way<>of<>37 81452.9843 169 89427 1463 48646 511 25368 229 the<>priests<>of<>38 81433.5951 11 89427 420 48646 272 25368 17 the<>whole<>of<>39 81424.3040 12 89427 478 48646 287 25368 16 the<>law<>of<>40 81419.8516 97 89427 588 48646 333 25368 111 the<>Jews<>of<>41 81417.7290 3 89427 253 48646 212 25368 3 the<>sword<>of<>42 81402.8199 24 89427 471 48646 288 25368 31 the<>morning<>of<>43 81366.9030 1 89427 505 48646 274 25368 2 the<>chief<>of<>44 81365.9421 70 89427 319 48646 236 25368 90 the<>field<>of<>45 81361.1708 27 89427 332 48646 243 25368 32 the<>top<>of<>46 81353.0009 132 89427 177 48646 163 25368 133 the<>wicked<>of<>47 81351.9657 5 89427 395 48646 249 25368 5 the<>rest<>of<>48 81347.0400 157 89427 587 48646 310 25368 164 the<>island<>of<>49 81314.2605 7 89427 301 48646 219 25368 10 the<>sight<>of<>50 81294.1331 187 89427 504 48646 207 25368 226 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 3-word collocations are found using Poisson's AM as follows: % statistic.pl --ngram 3 user3 test3.u3 test3.cnt The top 50 3-word collocations found using Poisson's AM are as shown: --test.u3-- //user3 measure output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 of<>the<>LORD<>1 2955.2691 1626 48646 89427 6653 14731 1630 5964 the<>children<>of<>2 2915.5269 1263 89427 1936 48646 1346 25368 1403 the<>son<>of<>3 2906.3098 1300 89427 2305 48646 1322 25368 1418 children<>of<>Israel<>4 2437.1633 650 1936 48646 2578 1403 650 1701 saith<>the<>LORD<>5 1941.7497 615 1262 89427 6653 891 615 5964 came<>to<>pass<>6 1878.7467 462 3349 30166 906 1080 462 719 the<>house<>of<>7 1836.9888 901 89427 2368 48646 1325 25368 978 out<>of<>the<>8 1738.1924 979 4357 48646 89427 1953 1290 14731 said<>unto<>him<>9 1445.7324 454 6088 8944 9642 1643 504 1126 it<>came<>to<>10 1282.3006 464 12324 3349 30166 501 816 1080 And<>he<>said<>11 1241.8753 452 13476 15266 6088 1637 1202 1022 the<>land<>of<>12 1213.9892 618 89427 1885 48646 1297 25368 646 Thus<>saith<>the<>13 1182.4689 291 514 1262 89427 305 313 891 And<>it<>came<>14 1179.5944 384 13476 12324 3349 622 533 501 the<>Lord<>GOD<>15 1131.4398 262 89427 1169 300 731 264 290 said<>unto<>them<>16 1118.7897 364 6088 8944 9649 1643 383 968 LORD<>thy<>God<>17 1075.9444 296 6653 4529 4614 304 675 351 the<>sons<>of<>18 1072.1146 504 89427 1110 48646 517 25368 598 unto<>the<>LORD<>19 1006.5100 488 8944 89427 6653 2020 489 5964 LORD<>of<>hosts<>20 907.1557 244 6653 48646 300 263 244 285 land<>of<>Egypt<>21 897.9751 227 1885 48646 612 646 227 422 +and<>I<>will<>22 866.0937 456 59758 23430 4922 1700 633 2044 saith<>the<>Lord<>23 849.4530 239 1262 89427 1169 891 241 731 it<>shall<>be<>24 835.0506 326 12324 10449 10176 614 667 2551 the<>midst<>of<>25 819.0449 334 89427 384 48646 383 25368 334 the<>king<>s<>26 775.3358 288 89427 2695 3699 1662 977 300 he<>said<>unto<>27 769.2952 283 15266 6088 8944 1022 466 1643 And<>thou<>shalt<>28 760.0523 212 13476 5017 1627 244 215 1024 according<>to<>the<>29 745.5014 323 799 30166 89427 734 341 3830 in<>the<>midst<>30 740.8262 267 20407 89427 384 6985 268 383 And<>the<>LORD<>31 721.3574 408 13476 89427 6653 2249 409 5964 the<>hand<>of<>32 706.9379 406 89427 1935 48646 453 25368 449 the<>king<>of<>33 683.5401 427 89427 2695 48646 1662 25368 851 in<>the<>land<>34 675.1543 325 20407 89427 1885 6985 381 1297 all<>the<>people<>35 661.7601 287 8710 89427 2470 2661 341 1266 the<>word<>of<>36 659.9338 331 89427 952 48646 448 25368 375 I<>pray<>thee<>37 657.1515 162 23430 356 3926 224 409 172 I<>could<>not<>38 656.3701 229 23430 1969 10847 782 1434 609 and<>said<>unto<>39 655.6299 314 59758 6088 8944 1043 641 1643 of<>the<>house<>40 638.2263 393 48646 89427 2368 14731 463 1325 the<>LORD<>thy<>41 637.2239 296 89427 6653 4529 5964 319 304 the<>name<>of<>42 630.4330 329 89427 1105 48646 350 25368 340 before<>the<>LORD<>43 625.5478 265 2676 89427 6653 742 265 5964 of<>the<>children<>44 613.8381 364 48646 89427 1936 14731 401 1346 I<>don<>t<>45 604.2793 120 23430 219 667 120 256 219 come<>to<>pass<>46 601.0908 164 2653 30166 906 523 164 719 and<>thou<>shalt<>47 582.1768 206 59758 5017 1627 344 225 1024 to<>pass<>when<>48 576.4219 167 30166 906 4141 719 227 167 of<>the<>earth<>49 574.9629 308 48646 89427 1142 14731 314 910 answered<>and<>said<>50 568.6228 180 602 59758 6088 190 180 1043 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can see that "the _ of" trigrams topping the list of the log likelihood AM output. This trend continues for rank 1777 (trigram #2810).This is due to the high frequency of 'the' and 'of'. Since we only consider the ratio of joint frequency and expected value for it if the 3 words were independent, this trend does not necessarily reflect in the Poisson's AM output. We observe that Poisson's AM has found better 3-word collocations. ex. land of Egypt, children of Egypt, lord of hosts. We also observe that the top 50 frequency rankings and Poisson's AM rankings are somewhat similar. (b) Cutoff Point - - - - - - - There does exist a cutoff point in the LL3 output as pointed out earlier. The cutoff exists at rank 1777, before which the trigrams are of kind "the _ of" and after which the other trigrams occur. As said earlier, this is due to the high frequency of 'the' and 'of'. We could say that the LL3 is influenced by the individual frequencies of the words in the trigrams. As for the output of Poisson's AM, I cannot find any interesting patterns. We do not claim that there are no patterns in the output, but at least they are not readily discernible to the eye unaided by statistics or mining tools. I looked for clustering of ranks and word-type trigrams clustering, but could not find anything interesting. We could say that the individual frequencies (like high frequencies of 'the' and 'of') do not affect Poisson's output as much as they do affect LL3 output. (c) Rank Comparison - - - - - - - - The rank comparison of LL3 and Poisson's AM for trigrams was done as follows: % rank.pl test3.ll3 test3.u3 Rank correlation coefficient = -0.1813 The negative value suggests a reverse relation, however since the value is closer to zero, we can say that they are more of less unrelated. The log likelihood as we have seen in part (c) is more similar to the Poisson's test, but our results do not confirm this. This would probably be because we only consider joint frequency of the trigram in Poisson's test. But, the results are better than log-likelihood, at least in this case (and the several other I tried). (d) Overall Recommendation - - - - - - - - - - - - From the results of my experiments, I would say that the 3-word collocations found by the Poisson's AM are more significant than those found by log-likelihood AM. However, we do not have enough data or theory to support this statement. Our conclusions regarding the best measure for finding collocations are further limited by the fact that we have considered only two of the innumerous AMs.++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Natural Language Processing CS8761 Bridget Thomson McInnes 10 October 2002 EXPERIMENT 1: -------------------------------------------------------------------------------------------------------------------- This experiment required the implementation of two modules to be used in conjunction with the N-gram Statistic Package. The two modules that were implemented calculate the True Mutual Information and the Symmetric Lambda measure. The modules were named tmi.pm and user2.pm respectively. True Mutual Information calculates how much one random variable tell about another. The formula that was used is: I(X;Y) = Sum( P(x,y) * lg( P(x,y) / P(x)P(y) ) ) = H(X) - H(X,Y). This states that "the information about X tells about Y is the uncertainty in X plus the uncertainty about Y minus the uncertainty in both X and Y" [1]. This formula can be re-written to correspond to the information of a contingency table. Given a contingency table: word2 ! !word2 ______________________ | | | word1 | n11 | n12 | n1p | | | ---------------------- | | | !word1 | n21 | n22 | n2p | | | ---------------------- -------- np1 np2 | npp The formula is: I(X;Y) = Sum( nij/n++ * lg( nij/eij ) ) Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell. The calculation of True Mutual Information for the contingency table will give the information word1 tells about word2's uncertainty in word1 plus the uncertainty about word2 minus the uncertainty in both word1 and word2. Symmetric Lambda (Goodman-Kruskal Lambda) is interpreted as the average of the probable improvement in predicting the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage reduction in errors in predicting the dependent given knowledge of the independent" [3]. The formula used to calculate lambda is: lambda = ( rF + cF - Fr - Fr ) / ( (2 * N) - Fr - Fc ), where rF = Sum of the maximum frequency in each row cF = Sum of the maximum frequency in each column Fr = maximum marginal row value Fc = maximum marginal column value N = total bigram count Lambda is "the percent one reduces errors in guessing the value of the dependent variable when one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent variable are known for each value of the independent variable" [3]. "A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent variable helps estimate the value of the dependent variable. PLEASE NOTE: the experiment was run using the complete works of Shakespeare found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare TOP 50 COMPARISONS: -------------------------------------------------------------------------------------------------------------------- Which measure seems better at identifing significant or interesting collocations? The tmi module and the user2 module, which I will call the lambda module from here on out, returned extremely different results when comparing their bigrams. There is not a single overlap when comparing the top 50 bigrams of each modules output. Meaning combining the top 50 bigrams of each module there are 100 unique bigrams. The tmi module returned quite a few Proper Noun collocations, for example: KING<>RICHARD<> MARK<>ANTONY<> HENRY<>VI<> PRINCE<>HENRY<> DOMITIUS<>ENOBARBUS<> QUEEN<>MARGARET<> DON<>PEDRO<> I found this to be very good. I believe those are collocations that are desired. The lambda module did not return any Proper Noun collocations. The tmi module's output returned very few formal collocations, for example: to<>be<> my<>lord<> These are good collocations because they represent the type of text that was used. The text that was used for this experiment was the complete works of Shakespeare. The words "my lord" and "to be" tend to occur often together throughout Shakespeare's writings. But there were only a few of that type of collocation. The tmi module returned 29 out of the top 50 bigrams containing function words or punctuation while the lambda module return zero. These are bigrams with an expected high frequency. Although, function word collocations are not 'interesting' collocations, it is interesting to note that the lambda module did not return any. Lambda indicates the percentage of error reduction in guessing the value of the dependent variable when knowing the independent variable. This is because when a collocations has a lower frequency value the lambda will move closer to one while when a bigram has a higher frequency value the lambda moves closer to zero. Because of this, the lambda module's output did not return any interesting bigrams in the top 50 ranks. In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared. For example: rich<>hangings<> blessed<>wood<> sweet<>Exeter<> rude<>hands<> secretly<>open<> desert<>country<> widow<>sister<> These collocations were very interesting. Although no Proper Name collocations appeared in the first 50 bigrams of the lambda module's output, there were quite a few 'true' collocations that appeared. More true collocations appeared in the last 50 bigrams of the lambda output than appeared in the first 50 bigrams of the tmi module's output. How would you characterized the top 50 bigrams found by each module? I would characterize the tmi module's top 50 bigrams as consisting mostly of Proper Noun collocations and function word collocations. It was very good at returning Proper Noun collocations which is interesting. I would characterize the lambda module's top 50 bigrams as very poor but I would characterized the last 50 bigrams as consisting mostly of interesting bigrams - 'true' bigrams. Bigrams that would be considered significant. Is one of these measures significantly 'better' or 'worse' than the other? Why? Determining whether the True Mutual Information measure is better or worse than the Lambda Symmetric measure is difficult. The True Mutual Information measure returned many Proper Noun collocations which are considered very significant collocations. Proper Noun collocations are collocations that a measure to distinguish. But the Lambda Symmetric measure return collocations that where much more interesting, for example "desert country". This describes something in the about the text that is being analyzed. Because of this I would have to say that the Lambda Symmetric measure performs 'better' than the True Mutual Information measure. This opinion is based on the comparing the top 50 bigrams in the tmi module's output to the last 50 bigrams in the Lambda Symmetric module. CUTOFF POINT: -------------------------------------------------------------------------------------------------------------------- Are there 'natural' cutoff points for scores, where bigrams above this value appear to be interesting or significant while those below do not? The lambda module's output values consist mostly of one and zero. At line 807, in the lambda output file, the value becomes less than one. At line 23185, in the lambda output file, the value goes to zero. This means that 22378 out of 357114 bigrams have a value that is not equal to one or zero. The group of bigrams that have a score of zero have a lot of interesting bigrams, for example: weary<>bones<> country<>wench<> While the group of bigrams that have a score of one on a whole do not have interesting bigrams, for example: UNIMPROVED<>unreproved<> noster<>Henricus<> Although, while getting examples of uninteresting bigrams with a score of one I found: helter<>skelter<>1 1.0000 2 2 2 Which, of course, is very interesting. I believe this shows that, significant bigrams are not confined to bigrams with a lambda score of zero but significant bigrams seem to occur more often when the score is one. As the lambda score grows closer to one the number of significant bigrams seems to decrease and likewise, as the lambda score moves closer to zero the number of significant bigrams increase. The highest score for True Mutual Informations is .0083 while the most common score greater than zero is .0001. Significant collocations seems to be dispersed throughout the table. For example: cunning<>hounds<>41 0.0000 2 155 66 chamber<>window<>40 0.0001 24 287 92 But Proper Name collocations have a rank of .0005 and higher which is a significant cutoff. The majority of the scores fall below .0002. From line 1345 and on, the scores have a value of .0001 or lower. This means that 355799 out of 357144 bigrams are .0001 or less. Any collocations with a score of .0001 or lower is going to be lost. RANK COMPARISON: -------------------------------------------------------------------------------------------------------------------- Comparison between ll.pm and tmi.pm. %csdev0540% perl rank.pl ll.txt tmi.txt %Rank correlation coefficient = -0.9268 Since the Rank correlation coefficient is close to -1 we can say that Loglikelihood Ratio and True Mutual Information are almost exactly opposite of each other Comparison between ll.pm and user2.pm. %csdev0542% perl rank.pl ll.txt user2.txt %Rank correlation coefficient = -0.6381 Again the Rank correlation coefficient is closer to -1 than 0. Therefore, we can say that the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We should notice though that the Symmetric Lambda measure is closer to the Loglikelihood Ratio than True Mutual Information is. Comparison between mi.pm and tmi.pm. %csdev0543% perl rank.pl mi.txt tmi.txt %Rank correlation coefficient = -0.9273 The Rank correlation coefficient is close to -1. Therefore, we can say that Pointwise Mutual Information and True Mutual Information are opposite of each other. They are 'more opposite' of each other than LogLikelihood and True Mutual Information. Comparison between mi.pm and user2.pm. %csdev0544% perl rank.pl mi.txt user2.txt %Rank correlation coefficient = -0.6364 The Rank correlation coefficient is closer to -1 than 0. Therefore, we can say that the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We should notice though that the Symmetric Lambda measure is closer to Pointwise Mutual I Information than True Mutual Information is. OVERALL RECOMMENDATION: -------------------------------------------------------------------------------------------------------------------- Which is the best measure for identifying significant collocations in large corpora: mi.pm, ll.pm, user2.pm or tmi.pm? I think that the best measure for identifying significant collocations in large corpora is the ll.pm module. The ll.pm module seems to rank interesting collocations higher. For example, looking at a few randomly selected collocations and their line numbers. Proper Noun Collocations QUEEN<>GERTRUDE<> ll.pl - 162 mi.pl - 23638 tmi.pl - 147 user2.pl - 2147 KING<>EDWARD<> ll.pl - 147 mi.pl - 50771 tmi.pl - 174 user2.pl - 5233 MARK<>ANTONY<> ll.pl - 16 mi.pl - 19732 tmi.pl - 18 user2.pl - 826 Significant Collocations your<>highness<> ll.pl - 183 mi.pl - 69966 tmi.pl - 204 user2.pl - 5943 pray<>you<> ll.pl - 86 mi.pl - 119610 tmi.pl - 90 user2.pl - 290548 Sir<>John<> ll.pl - 89 mi.pl - 34664 tmi.pl - 94 user2.pl - 5162 my<>heart<> ll.pl - 114 mi.pl - 143689 tmi.pl - 111 user2.pl - 32081 We can see from the above examples that the ll.pm module gives interesting collocations a higher value. I chose to look at the line number to justify why I picked Loglikelihood because there are the same number of bigrams in each file and the line number gives us an idea of where they fall on each of the different measures ranking system. It must be noted that the user2.pl module ranks better collocations closer to the bottom than the top. Yet, even keeping this in mind the ll.pm module identifies the significant collocations better. EXPERIMENT 2: -------------------------------------------------------------------------------------------------------------------- This experiment required the implementation of two modules to be used in conjunction with the N-gram Statistic Package using trigrams. The two modules that were implemented calculate the Loglikelihood Coefficient and the Symmetric Lambda measure. The modules were named ll3.pm and user3.pm respectively. Log Likelihood coefficient determines how much more likely one hypothesis is from another. The formula that is used is: G^2 = Sum( (2 * nij) * log( nij / eij) ) This states how much more likely the known values are from the expected values of the trigram. The formula above corresponds to the following contingency table: | w3 | !w3 | ____|__________|__________| | | | | | w2| n11 | n12 | n1p w1 |___|__________|__________|___ | | | | |!w2| n21 | n22 | n2p ____|___|__________|__________|___ | | | | | w2| n31 | n32 | n3p !w1 |___|__________|__________|___ | | | | |!w2| n41 | n42 | n4p |___|__________|__________| _____ np1 | np2 | npp Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell. The calculation of the Loglikelihood Coefficient for the contingency table will give the known probability of the words compared to the expected probability of the words. Symmetric Lambda (Goodman-Kruskal Lambda) is interpreted as the average of the probable improvement in predicting the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage reduction in errors in predicting the dependent given knowledge of the independent" [3]. The formula used to calculate lambda is: lambda = ( rF + cF - Fr - Fr ) / ( (2 * N) - Fr - Fc ), where rF = Sum of the maximum frequency in each row cF = Sum of the maximum frequency in each column Fr = maximum marginal row value Fc = maximum marginal column value N = total trigram count Lambda is "the percent one reduces errors in guessing the value of the dependent variable when one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent variable are known for each value of the independent variable" [3]. "A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent variable helps estimate the value of the dependent variable. PLEASE NOTE: the experiment was run using the complete works of Shakespeare found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare TOP 50 COMPARISONS: -------------------------------------------------------------------------------------------------------------------- Which measure seems better at identifying significant or interesting collocations? The ll3.pm module and the user3.pm module, which I will call the loglikelihood module and the lambda module respectively from here on out, returned extremely different results when comparing them. There is not a single overlap when comparing the top 50 trigrams of each modules output. Meaning combining the top 50 trigrams of each module there are 100 unique trigrams. The loglikelihood module returned trigrams that are similar to the below examples: the<>Duke<>of<> 2 20253.8357 178 51427 502 33153 188 7915 424 the<>name<>of<> 3 19756.8674 121 51427 1383 33153 153 7915 176 the<>Earl<>of<> 4 19304.0744 46 51427 166 33153 46 7915 156 the<>house<>of<>5 19223.6022 72 51427 1159 33153 202 7915 92 49 out of the top 50 trigrams consist of similar trigrams. The only trigram of the 50 that did not look similar to the examples above was: ",<>my<>lord<>1 26939.4642 1938 181185 22593 4425 4400 2068 2532". This trigram was not very interesting either. Lambda indicates the percentage of error reduction in guessing the value of the dependent variable when knowing the independent variable. This is because when a collocations has a lower frequency value the lambda will move closer to one while when a collocation has a higher frequency value the lambda moves closer to zero. Because of this, the lambda module's output did not return any interesting collocations in the top 50 ranks. In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared for bigrams. I expected this to be true for trigrams also but it was not. The last 50 trigrams contained punctuation. For example: FORMAL<>regular<>,<> Throca<>movousus<>,<> Although even with punctuation, there were interesting trigrams. For example: rigorously<>effused<>,<> Perfect<>chrysolite<>,<> affectedly<>Enswathed<>,<> But as seen, the trigrams end with a punctuation. How would you characterized the top 50 trigrams found by each module? I would characterize the loglikelihood module's top 50 trigrams as consisting mostly of uninteresting trigrams. The trigrams all consist of "the of". I would characterize the lambda module's top 50 trigrams as very poor also. Is one of these measures significantly 'better' or 'worse' than the other? Why? I don't think one of the models is better or worse than the other. Neither return outstanding collocations. Even taking into consideration that the lambda module ranks better collocations closer to the bottom than the top, the collocations are not note worthy. CUTOFF POINT: -------------------------------------------------------------------------------------------------------------------- Are there 'natural' cutoff points for scores, where trigrams above this value appear to be interesting or significant while those below do not? The loglikelihood module's trigrams that consist of "the of" end at rank 586. They then continue on with "I not" trigrams, then move to "The of" trigrams. Not very interesting. The trigrams following follow a similar pattern. The trigrams come in blocks. Each block looking similar. For example, the "I you" block and the ", ." block. Some trigrams seem to look like small sentences more than collocations. For example, "YOU<>LIKE<>IT<>". Other trigrams look like prepositional phrases. For example, "more<>mirth<>than<>" and "more<>faithful<>than<>" which may be interesting but I don't believe that they are collocations. This may have to do with the data that I am using. Most sentences in Shakespeare's works are short. The lambda module's trigrams do not have a natural cut off either. The top 100 trigrams were not very good collocations: For example: ,<>these<>encounterers<> mischievous<>UNHATCHED<>undisclosed<> The last 100 trigrams were not very good either which is not what I expected. They consisted of punctuation. For example: .<>CXLV<>.<> ,<>Tombless<>, Not very useful nor interesting. One other point to make about the lambda module. For trigrams, I came up with negative values. This measure is suppose to be available to use for trigram data. I am not certain why the negative values showed. There does not seem to be anything wrong with the implementation. I did not get negative values until I ran my program with a very large corpora. The data worked perfectly on smaller data sets. The results stayed between 1 and -1. The trigrams that recieved a score of -1 were similar to the trigrams that recieved a score of 1. Both sets contained trigrams with punctuation. I did not read in any of the literature that the absolute value of the score must be taken but this is one possiblity that I looked into. OVERALL RECOMMENDATION: -------------------------------------------------------------------------------------------------------------------- Which is the best measure for identifying significant collocations in large corpora: ll3.pm or user3.pm? In my opinion, ll3.pm is the best measure to use. Neither measure identified interesting or significant collocations. This may have to do with the data that I was experimenting with but I did not see any grouping of good collocations. They seemed to be dispersed throughout the tables. I believe that ll3.pm is the better measure due to the negative numbers I was getting for the user3.pm module. SOURCES: ------------------------------------------------------------------------------------------------------------------- [1] www.engineering.usu.edu/classes/ece/7680/lecture2/node3.html [2] www.zi.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap28/sect20.htm [3] ww2.chass.ncsu.edu/garson/pa765/assocnominal.htm ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Suchitra Goopy 10/11/2002 Natural Language Processing ------------------------------------------------------------------------------ Assignment 2 The Great Collocation Hunt ------------------------------------------------------------------------------ INTRODUCTION: ------------ "A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things" -Foundations of Statistical Natural Language Processing (Christopher D.Manning and Hinrich Schutze) This experiment was performed with the aim of finding a test of association that can be implemented both for bigrams and trigrams.The nsp package was implemented by "Satanjeev Banerjee" and has many interesting measures of association for bigrams.Though nsp has been extended to accomodate ngrams, a suitable measure of association(for nograms) is yet to be implemented. The test of association that I have used in this assignment is the "Goodman and Kruskal's Gamma".This test was taken from "Handbook of Parametric and NonParametric Statistical Procedures" by David J.Sheskin.A test of association is usually performed to see if the variables involved in the test are dependent on each other,or are independent.Many tests can also be used to determine how strong or how weak the association between the two variables is. Though I learnt that the "Goodman and Kruskal's Gamma" can be extended to accomodate trigrams,I have not found any supporting documents or references for this.But I have extended it to perform for trigrams as well. "The Goodman and Kruskal's Gamma is a signed test that measures both the strength and direction of the relationship,the sign indicating whether a relationship is positive or negative" - Taken from a website,the url is given below http://216.239.53.100/search?q=cache:VBkDxI8tR70C:www.bus.ed.ac.uk/cfm/cfmr988.pdf+extending+goodman+and+kruskal%27s+gamma++for+3+variables&hl=en&ie=UTF-8 EXPERIMENT 1: ------------- The two main modules that had to be implemented for performing Experiment 1 were tmi.pm and user2.pm.The module "tmi.pm" was implemented to calculate "True Mutual Information",while user2.pm was implemented for bigrams using the Goodman and Kruskal's Gamma. TOP 50 COMPARISONS ------------------ First a comparison was made between the outputs of tmi.pm and user2.pm I found that user2 had more interesting collocations than tmi.Though there were a number of collocations in the top 50 of user2 that did not make much sense,a few of the interesting collocations I found were : picked<>up<>1 1.0000 1 1 385 stark<>naked<>1 1.0000 2 2 15 complied<>with<>1 1.0000 1 1 1071 split<>open<>5 0.9996 1 2 32 laboured<>hard<>15 0.9986 1 3 51 banished<>from<>32 0.9969 2 3 443 The above words usually tend to occur together and were also highly ranked by user2.If the value is closer to 1 then it indicates a very strong positive relationship.As we went down the ranks some of the following collocations I found were: shore<>would<>6397 0.0395 1 269 482 water<>a<>6842 -0.0971 2 148 2296 cannibal<>coast<>19 0.9982 1 4 43 enough<>in<>6940 -0.1348 1 98 1871 These were not good collocations and were ranked low by user2.Also some of them have a negative sign that indicates that the words have a negative relationship with each other. Tmi on the other hand had ranked the following to be very high.Though "of the" tends to occur together very often, or rather most of the time,it does not really form a collocation,we do not have enough evidence to say it forms a collocation.(It does not make much sense by itself) of<>the<>4 0.0086 817 3558 5854 Some of the good collocations in tmi were ranked down the list.Like the ones given below: cried<>out<>48 0.0007 13 18 423 cast<>away<>49 0.0006 12 32 147 rainy<>season<>50 0.0005 7 14 31 I feel that "dice.pm" would have performed better than the above two tests. This is because it has more information when it ranks the bigrams and hence does a better job. CUTOFF POINT ----------- On looking at the output for tmi I did not find any cut-off point above which the collocations were interesting.There were some vague and funny collocations even in the first few ranks.Sometimes I found some interesting collocations towards the middle,say around the 50th or 60th ranks. The output for user2 had good collocations in the first few ranks.Though I looked for a cut-off point I was not able to really find one.It was true that as the ranks decreased,the interesting collocations decreased as well, but just as I thought that I had found a cut-off point I would see a few good collocations,and would have to rethink the cut-off point. RANK COMPARISON --------------- I used rank.pl to compare the output produced by ll.pm and tmi.pm rank.pl ll.out tmi.out I got a rank correlation coefficient of -0.0338 This indicates that the two rankings are reversed. I compared ll.pm and user2. rank.pl ll.out user2.out 0.7117 This shows that they are ranked in almost the same way. rank.pl mi.out tmi.out -0.1136 A comparison between mi and tmi shows that they are reverse ranked. mi.out user2.out 0.9229 This shows that mi and user2 are ranked in almost the same way. Each measure is different in that they extract or rather,capture different elements.The tests of association are varied and differ in what they consider, to perform their calculatations or carry out the test.For example,if a measure uses frequency values in it's test then prepositions will occur more frequently and hence will be ranked higher,but that does not mean that they form collocations. OVERALL RECOMMENDATION ---------------------- On comparing the 4 modules I found that mi.pm had very good results.Many collocations were interesting,like the ones below pro<>cessing<>1 17.0977 1 1 1 tax<>deductible<>2 16.0977 1 2 1 United<>States<>2 16.0977 2 2 2 After mi.pm I would think that user2.pm performed very well.This is because some good collocations were ranked very high,and as the rank decreased most of the bigrams did not make much sense. Then tmi.pm performed well compared to ll.pm.It seemed to give higher ranks to better collocations.I came to this conclusion beacuse ll.pm did not seem to do the ranking as effective as tmi.pm.For instance,here is an example where the same bigram was ranked differently.I noticed this trend for a number of them ll.pm : cast<>away<>335 123.3642 12 32 147 ll tmi.pm :cast<>away<>49 0.0006 12 32 147 ---------------------------------------------------------------------------- EXPERIMENT 2: ------------ The two main modules that had to be implemented for experiment 2 were ll3.pm and user3.pm.ll3.pm was the Log Likelihood Ratio for trigrams. user3.pm was the same test of association used for bigrams,but had to be extended for trigrams. TOP 50 COMPARISONS ------------------ When I compared the output produced by ll3.pm and user3.pm,I noticed that most of the trigrams(atleast in ll3) had atleast one or two punctuation marks like , or ; .Then I tried to eliminate the formation of these trigrams by using the --stop*** option provided in nsp.But this option allows you to remove the trigram only if it is made up entirely of all the stop words in the STOP file.This did pose some problems when trying to find and analyse the interesting trigrams.Probably,if I had been able to eliminate the punctuation in the trigrams,more meaningful collocations could have been formed. user3.pm provided more interesting trigrams than ll3.pm.As said above ll3.pm had a number of trigrams with punctuation marks in it in the first 100's of ranks.The following were some of the interesting trigrams that I noticed in the output produced by user3.pm services<>for<>which<>1 1.0000 1 1 1320 889 1 1 8 Thank<>God<>!<>1 1.0000 1 1 159 92 1 1 3 uneven<>state<>of<>1 1.0000 1 1 22 3558 1 1 13 circumstance<>presented<>itself<>13 0.9988 1 3 12 31 1 1 2 Please<>note<>:<>50 0.9951 1 2 6 337 1 1 1 ll3.pm on the other hand had a lot of punctuation marks in the trigrams, and most of the times it wasnt a good collocation.Though some words tend to occur together as they did,they do not necessarily form collocations. CUTOFF POINT ----------- I did not find any interesting cut-off point above which there were unusual or good collocations.There were a number of good collocations in the beginning and later too you could find some collocations that were really good, sometimes better than the higher ranked ones.I found this to be true for both ll3.pm and user3.pm RANK COMPARISON --------------- I used rank.pl to compare ll3.pm and user3.pm.The rank correlation coefficient I obtained was ll3.out user3.out -0.2018 This shows that the two tests have a completely reversed ranking. The two tests differ in that they appear to capture different elements and rank them accordingly.Log likelihood is a very popular test of measure when compared to "Goodman and Kruskal's Gamma".But in many ways I felt that the performance of user3 was much better. OVERALL RECOMMENDATION: ----------------------- Though I have not found proof to support the test of association in user3.pm I would think that it is working fairly well.I found some trigams that formed very good collocations.Also I feel that if there was a mechanism to eliminate the punctuation marks from occuring in the trigrams a better analysis of the experiment could have been carried out. CONCLUSION ---------- I did notice one thing in the experiments.A strong association between the words does not necessarily mean they form a collocation.They could have a high ranking because they probably appeared a number of times in the text and hence were put together,but they would definitely not form a collocation. This experiment also helped me to perform a lot of research in trying to find a test of association.I also spent a considerable time thinking about the trigrams and bigrams and trying to see if they formed meaningful collocations. ***The stop option was provided by creating a file STOP.txt This File had ; , etc each on a different line Then the --stop option was used on the command line when executing the program++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Paul Gordon CS8761 1913768 10/10/02 Assignment 2 -- The great Collocation Hunt Introduction: The great collocation hunt is an experiment in using statistical measures of association to rank and categorize collocations. The intention is to rank the most "interesting" collocations first, followed by more mundane, or less interesting collocations. In this experiment, four tests of association were used with bigrams, and the results compared with each other. Two of these tests, are part of the nsp package, mi.pm, which measures pointwise mutual information, and ll.pm, which measures the log-likelihood. Two of the measures were implemented by the experimenter, tmi.pm, which measures true mutual information, and user2.pm, which measures the odds ratio. Also, two tests of association were used with trigrams, ll3.pm, which measures log-likelihood, and user3.pm, which measures the odds ratio for trigrams. The Corpus used for the experiments was the combination of several Project Gutenberg files of the following texts: Crime and Punishment, Of Human Bondage, Moby Dick, and The complete King James Bible. BIGRAMS: user2.pm, tmi.pm, mi.pm and ll.pm TOP 50 COMPARISON The original implementation of True Mutual Information (tmi.pm) had only 49 ranks for a corpus of nearly 2 million tokens. This was a problem. Apparently TMI can become quite small with large corpora. As a result, I decided to multiply my result by a constant factor. This wouldn't affect the ordering, other than to make it a finer grained measure. The alternative, modifying the number of significant figures reported by statistic.pl was not an option. User2.pm, which measures the Odds Ratio, did well at finding proper nouns. It is not until number 21 with "categorical imperative", that the collocation is something other than a proper noun ( although "PUBLIC DOMAIN" is not really a proper noun ). The top ten result follow: Pulcheria<>Alexandrovna<>1 894532483.0000 123 123 123 Avdotya<>Romanovna<>2 836590755.0000 115 115 115 Moby<>Dick<>3 633790675.0000 87 87 87 Svidriga<>lov<>4 214698175.0000 207 210 207 Marfa<>Petrovna<>5 189534429.6667 78 78 79 Nikodim<>Fomitch<>6 177467563.0000 24 24 24 PROJECT<>GUTENBERG<>7 119519499.0000 16 16 16 El<>Greco<>8 97788843.0000 13 13 13 Praskovya<>Pavlovna<>9 68814523.0000 9 9 9 Sag<>Harbor<>10 61570923.0000 8 8 8 This particular measure scores high when the bigram parts are only associated with that bigram -- notice for the top three, the first and last words never occur as first and last words in any other bigrams. Just beyond "categorical imperative there are several interesting collocation, including "hoky poky", "Pell mell" and "web browser". In contrast, there are virtually no proper nouns in the top 50 of the true mutual information measure, Lord and Israel being the exceptions. The top ten results from the mutual information measure are shown below. It appears that in order to rank high, not only does the bigram have to appear often, but each individual word has to appear often as well. This makes sense considering that tmi is summed over all squares of the contingency table. Notice that the tmi is much larger than it should be. This is because of the multiplying factor mentioned above. To get the actual value, divide by 1000000000. ,<>and<>1 43006180.2526 32702 119732 59360 of<>the<>2 13431457.2323 15222 50674 94305 the<>LORD<>3 12450162.9418 5962 94305 6648 .<>He<>4 9075845.1419 4181 65625 5246 in<>the<>5 7285833.3391 7752 23359 94305 shall<>be<>6 6198847.1158 2526 10154 10124 ;<>and<>7 5242192.1121 4755 17562 59360 don<>t<>8 5106985.6083 972 973 2957 I<>will<>9 4941218.7702 2055 18788 4891 I<>am<>10 4360132.2218 1323 18788 1537 CUTOFF POINT There does not appear to be any good cutoff point for the true mutual information measure. But then, that is because so many of the top ranking collocations don't appear to be very interesting, just common. There are many connecting words in the collection. And there isn't much to distinguish the underlying corpus. There appears to be a cutoff area, rather than a cutoff point for the odds ratio. All bigrams whose constituents are only a component of that bigram, and which only appears once ( 1, 1, 1 ) are ranked 23. Between there and rank 40, the bigrams appear to go from interesting to unusual. At rank 40, one of the bigrams is "beneficent publicity". This strikes me more as an unusual phrase rather than a collocation. There are many of these by rank 40. This measure picks up bigrams whose components appear together a lot relative to the amount they appear within other bigrams. So at first the measure picks of interesting pairs, but further down the rank, they lose their interesting qualities. RANK COMPARISON rank ll to user2 : 0.6858 rank ll to tmi : 0.9984 rank mi to user2 : 0.9949 rank mi to tmi : 0.7049 True mutual information is more like the log-likelihood measure and user2 is more like pointwise mutual information. log-likelihood and true mutual information probably have a high correlation because they have a similar method of summing logs over each square of the contingency table. It is less clear to me why there is such a high correlation between the pointwise mutual information, and the odds ratio. I cannot see a relationship between the two fractions used in these measures. What is more, the pointwise mutual information calculation take the log of the fraction. In general, though, neither measure sums over each square of the contingency table, and so should be more like each other than they are like the other two measures. OVERALL RECOMMENDATION Based on the above experiments, it appears that the odds ratio does a much better job of giving interesting collocations a high rank. However, it also seems clear that the odds ratio misses interesting collocations with common words. As the ranking results suggest, the log-likelihood measure is very similar to that of true mutual information. In fact, many of the same bigrams appear in the top 50 of each. As such, though, it suffers a similar problem, and that is that the "interesting" collocations are buried under the common ones. Pointwise mutual information is not as closely similar to the odds ratio, as tmi is to the log-likelihood. It is, rather, similar in style. It also seems to rank highly, bigrams whose components don't make up other bigrams, but it is not as extreme as the odds ratio. After examining the ranking each of these measures gave the bigrams in the corpus, I consider the pointwise mutual information to have the best list of interesting bigrams. TRIGRAMS: user3.pm and ll3.pm TOP 50 COMPARISON As with the bigram form of the odds ratio, the trigram model picks up a lot of names. Now, however, it is attaching whatever the preceding word is with it. So nearly the whole top fifty words are one of two names preceded with another word. Obviously this is not very useful. Regarding the trigram log-likelihood, it is also much as it was in the bigram model. It ranks highly very common token combinations like "and the ,". Although the trigrams definitely give more of a sense of the underlying text, especially from the bibles contribution. It turns out that the bible was an unfortunate choice for inclusion, because of the chapter-verse markers which were noticeable with the bigrams, but are now very pronounced. CUTOFF POINT Again there did not appear to be any clear cutoff for the log-likelihood measure. Indeed, it seemed as though the quality of the bigrams improved with depth. But even at ranks above 4,000, they don't appear to me to be very interesting. The Odds ratio statistic ranking has a somewhat strange cutoff. After about rank 500, the names start to drop off, and interesting collocations begin to appear. I suppose this is a result of the this measures affinity to rank highly, proper nouns. OVERALL RECOMMENDATION Based on the results of both trigram statistics, my overall recommendation is for the odds ratio measure. It, in my estimation, has many more interesting trigrams. It would be especially good if there was a way to filter out the repetition of proper nouns. To me, the log likelihood ranks seem just as uninteresting at the beginning, as further down. Actually, somewhere in the middle, there appear to be what I would call interesting bigrams. For example the following is take from around the 400,000 rank is<>life<>;<>404855 189.2558 1 10385 1067 17562 11 211 54 these<>cases<>,<>404856 189.2550 3 1602 64 119732 4 229 23 were<>consumed<>with<>404857 189.2540 1 5305 108 12113 7 118 11 the<>letter<>suddenly<>404858 189.2532 1 94305 228 446 70 3 1 eyes<>were<>somehow<>404859 189.2528 1 1203 5305 83 49 1 4 while<>I<>live<>404860 189.2527 1 676 18788 441 37 2 43 near<>here<>,<>404861 189.2510 2 350 883 119732 3 30 177 be<>set<>for<>404862 189.2469 2 10124 912 12320 23 193 8 justice<>take<>hold<>404863 189.245 "while I live" and "justice take hold" seem to me to be good candidates for "interesting" trigrams. CONCLUDING THOUGHTS Conducting the experiments for this assignment has gotten me to think more deeply about what a makes a bigram a collocation. In one sense, a collocation is a bigram where the likelihood of the two words occurring together is less than what is actually observed. That is, there is some statistical significance. But part of this assignment was to assess how well the measures did, which is a subject criteria. In the Manning and Schutze chapter on collocations, they were interested in fairly common words such as "strong tea" and "powerful computers." Yet the results I got from the statistical tests seemed to rank high either very general word combinations like "is a" or fairly uncommon pairs like "El Greco." ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Name: Prashant Jain CS8761 Assignment #2:The Great Collocation Hunt Introduction ------------- This assignment involved finding a suitable measure of association that can identify collocations-which are defined in the