++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ****************************************************** *CS8761 - Assignment 2 - The Great Collocation Hunt * *Author:Amine Abou-Rjeili * *Date: 10/09/2002 * *Filename: experiments.txt * ****************************************************** Objective --------- To investigate the effectiveness of different measures of association in identifying collocations and compare their performance. All measures are to be implemented as NSP (v0.51) modules. The True Mutual Information (tmi.pm) for bigrams and the Log Likelihood Ratio for trigrams were to be implemented and compared with an additional measure that can be used with both bigram and trigram collocations. After extensive research using online resources and library books, I decided to use the t-score association measure. This is a statistical test that tells us how probable or improbable it is that a certain event will occur (Manning and Schutze). It is defined as: t = (sample_mean - distribution_mean) / sqrt(sample_variance/sample_size) This test assumes that a sample is drawn from a normal distribution and it determines how probable it is to draw a sample of this mean and variance. It compares the sample mean and distribution mean and scales it by the sample variance. In these experiments the NULL hypothesis is: H(0): P(W1 W2 ... Wn) = P(W1)*P(W2)* ...* P(Wn) In other words we are assuming that the words are independent and that the occurance of N word collocations is merely by chance. Based on the t-score for a particular collocation we can reject the NULL hypothesis with a certain confidence interval or accept it as being true if we do not fall inside the required confidence interval. The distribution mean is taken to be the product of the probability of the occurance of the individual words of the collocation [p(W1)*p(W2)*...*P(Wn)] since we are assuming that they are independent (NULL hypothesis). I chose the t-test because of its simplicity in implementation. I wanted to compare its performance with other association measures that measure actual counts of events and not on estimated sample mean and variance. I wanted to know how well it will perform compared to the more complex measures of association. I believe that it will not perform as good as the true mutual information because it considers mainly the joint probability of the individual words of the collocation. On the other hand the true mutual information test also considers the cases where the individual words occur separate together with the other cases. Another reason that I chose this test is because it extends easily to trigrams. The NULL hypothesis becomes P(W1 W2 W3) = P(W1) * P(W2) * P(W3) and so the distribution mean will equal P(W1) * P(W2) * P(W3). As we can see this extension is straightforward to understand and to implement (see Experiment 2). I created two corpora of about 1,000,000 words by joining a number of books from Project Gutenberg as follows: CORPUS 1: # TOKENS 1083129 Don Quixote by Miquel de Cervantes Crime and Punishment, by Dostoevsky Robinson Crusoe, by Daniel Defoe The Adventures of Sherlock Holmes by Arthur Conan Doyle Hamlet by Shakespeare PG has multiple editions of William Shakespeare's Complete Works CORPUS 2: # TOKENS 1037882 ==> bndlt10.txt <== The Project Gutenberg Etext of A Bundle of Letters by Henry James ==> frnrd10.txt <== The Project Gutenberg etext of The Friendly Road; New Adventures in Contentment by David Grayson (pseudonym of Ray Stannard Baker) ==> ktysc10.txt <== The Project Gutenberg EBook of What Katy Did At School, by Susan Coolidge ==> lchch10.txt <== The Project Gutenberg EBook of Chess and Checkers: The Way to Mastership by Edward Lasker ==> lstfc10.txt <== The Project Gutenberg Etext of Lost Face, by Jack London ==> mrswf10.txt <== The Project Gutenberg Etext of The Mayor's Wife, by Anna Katherine Green ==> outpo10.txt <== The Project Gutenberg Etext of Outpost by J.G. Austin ==> rbcru10.txt <== The Project Gutenberg Etext of Robinson Crusoe, by Daniel Defoe ==> rdst10.txt <== The Project Gutenberg EBook of Rodney Stone, by Arthur Conan Doyle ==> stjlp10.txt <== The Project Gutenberg EBook of The Story Of Julia Page, by Kathleen Norris ==> strlc10.txt <== The Project Gutenberg Etext of The Story Of Electricity, by John Munro ==> trbny10.txt <== The Project Gutenberg EBook of The Rover Boys in New York, by Arthur M. Winfield ==> txsrn10.txt <== The Project Gutenberg EBook of A Texas Ranger, by William MacLeod Raine ==> wstys11.txt <== The Project Gutenberg Etext of Under Western Eyes, Joseph Conrad ********************************************** *****************Experiment 1***************** ********************************************** TOP 50 COMPARISON: ------------------ Using CORPUS2: Bigrams ------- count.pl CORPUS_2N.cnt CORPUS2 statistics.pl tmi CORPUS_2N.tmi CORPUS_2N.cnt Top 26 entries shown for each test tmi user2 (t-score) --- --------------- 1206150 1206150 ,<>and<>1 0.0232 14223 79083 31336 Todos<>los<>1 1098.2477 1 1 1 .<>The<>2 0.0109 3436 53260 4038 COINCIDENCES<>SUGGESTING<>1 1098.2477 1 1 1 don<>t<>3 0.0087 1256 1257 4248 Fops<>Alley<>1 1098.2477 1 1 1 of<>the<>4 0.0084 6644 26941 52108 laisser<>aller<>1 1098.2477 1 1 1 .<>He<>5 0.0082 2503 53260 2834 Scots<>Magazine<>1 1098.2477 1 1 1 .<>It<>6 0.0075 2331 53260 2686 superseded<>manual<>1 1098.2477 1 1 1 ,<>,<>7 0.0066 2 79083 79083 los<>Santos<>1 1098.2477 1 1 1 in<>the<>8 0.0065 4523 15673 52108 _ce<>miserable_<>1 1098.2477 1 1 1 .<>But<>9 0.0059 1801 53260 2055 ga<>irls<>1 1098.2477 1 1 1 .<>I<>10 0.0058 5149 53260 22711 Ari<>Frode<>1 1098.2477 1 1 1 ,<>but<>11 0.0052 2690 79083 4665 CAVE<>RETREAT<>1 1098.2477 1 1 1 .<>She<>12 0.0050 1560 53260 1806 strutting<>cockerel<>1 1098.2477 1 1 1 ,<>.<>13 0.0044 1 79083 53260 Sallie<>Laundon<>1 1098.2477 1 1 1 the<>,<>14 0.0043 2 52108 79083 DARTAWAY<>Quit<>1 1098.2477 1 1 1 had<>been<>15 0.0042 1013 7956 2446 quel<>nom<>1 1098.2477 1 1 1 Mrs<>.<>16 0.0041 1103 1105 53260 personne<>indiquee_<>1 1098.2477 1 1 1 ;<>and<>17 0.0039 2020 7387 31336 GIMLET<>BUTTE<>1 1098.2477 1 1 1 .<>And<>17 0.0039 1341 53260 1735 INTERFERENCE<>Struck<>1 1098.2477 1 1 1 .<>You<>18 0.0038 1298 53260 1704 Highland<>Fling<>1 1098.2477 1 1 1 Mr<>.<>18 0.0038 1020 1029 53260 Humphry<>Gunther<>1 1098.2477 1 1 1 did<>not<>18 0.0038 790 1643 5777 WHICH<>CHRISTIAN<>1 1098.2477 1 1 1 .<>,<>19 0.0037 133 53260 79083 Andrew<>Gamble<>1 1098.2477 1 1 1 ;<>but<>20 0.0034 1026 7387 4665 TAMES<>GOATS<>1 1098.2477 1 1 1 to<>be<>20 0.0034 1532 27462 5008 Ferned<>grot<>1 1098.2477 1 1 1 I<>am<>20 0.0034 823 22711 968 MOST<>SUDDEN<>1 1098.2477 1 1 1 Project<>Gutenberg<>21 0.0033 322 408 322 azure<>firmament<>1 1098.2477 1 1 1 Considering the top 50 entries in each test, it is clear that the t-score gave better results as compared to the tmi. Most of the top entries of the tmi test include puntuation marks and stop words which are not of much interest. One interesting collocation identified by the tmi measure is "Project<>Gutenberg<>". On the other hand the t-score test gives more meaningful words in the collocation and identifies some good collocations such as "Intellectual<>Peach<>", which most probably has some meaning as a unit of words and does not actually mean an 'intellectual peach'. another good collocation identified by the t-score test is "Padre<>Johannes<>" which is of the proper nouns form. Another interesting collocation is "Rural<>Taste<>". For most of the entries in the t-score table we can safely ignore the NULL hypothesis with a confidence interval of 99.5%. On the other hand there is no comparable measure of confidence interval for the true mutual information measure. From the current experiments, one might be tempted to conclude that the t-score measures are significantly better at identifying interesting collocations as compared to the true mutual information measures. However, here I would like to point out the fact that it might be the case that the t-score measure might be over-classifying entries. In other words it thinks that most bigrams are collocations while they are not. There might be a lot of false positives in the data and from the above sample output it can be seen that some of the top bigrams belong to this classification. This also can be true for the true mutual information measures as well. Here I would like to suggest the use of some sort of ratio that will calculate the accuracy of the measure. In other words it will tell us how many of the bigrams are false positives and which ones are actually true collocations. However, if we disregard the idea of false positives, I can conclude from the above data that the t-score performed better than the true mutual information test in this experiment. One reason for this difference in results can be attributed to the expected frequency of the individual tokens for a bigram. For example, punctuation marks might have appeared a lot of times in the text and this will affect the calculation of the tmi test since it is based on the expected and observed frequencies for each cell. However, the t-score test does not consider all the cells in the contingency table, as discussed above. CUTOFF POINT: ------------- User2.pm: For this measure, it seems that a cutoff exists approximately around the 150 (+-20) rank item. It can be observed that more and more bigrams do not seem to be good collocations. After about the 120th rank it seems that at least a third of the bigrams are not collocations. Since this test has as the numerator the difference between the sample mean and the distribution mean and as we go deeper down in the list the rank decreases, this means that the difference is decreasing at a higher rate compared to the numerator. This indicates that the sample mean and the distribution mean are merging to a common value. Since our NULL hypothesis suggests that the tokens of the sequence are independent of each other, and the mean of the distribution is based on this and the sample mean is converging towards the distribution mean then the sequence in our sample will become more and more independent as we move down the list. tmi.pm: For this measure, there is a cutoff around the 50th rank item. It seems that most of the bigrams in the top positions are not bigrams and consist of punctuation marks and stop words. However, after about the 50th rank we start to see less quotation marks and more word bigrams. We also start to observe some interesting collocations such as "Web<>sites<>" which occurs at rank 50. On the other hand the user2.pm measure ranks this collocation at rank 120 which presents a big difference between the two rankings. This difference is due to the different counts that each measure considers. One frequency count that the true mutual information considers but the t-score does not is the count of when both words did not occured together, and I suggest that this contributes to this difference. RANK COMPARISON --------------- Comparing ll with tmi and user2: -------------------------------- ***************************************************** csdev22(52):>./rank-script.sh ll tmi CORPUS_2N.cnt Rank correlation coefficient = -0.8878 csdev22(53):>./rank-script.sh ll user2 CORPUS_2N.cnt Rank correlation coefficient = 0.8745 ***************************************************** Comparing mi with tmi and user2: -------------------------------- ***************************************************** csdev18(15):>./rank-script.sh mi tmi CORPUS_2N.cnt Rank correlation coefficient = -0.8889 csdev18(16):>./rank-script.sh mi user2 CORPUS_2N.cnt Rank correlation coefficient = 0.9516 ***************************************************** From the above results it can be observed that user2.pm (t-test) is very closely related to both the mutual information and the log likelihood ratio. To add to this, it is near perfect match to mutual information. On the other hand the true mutual information test is a reversed ranking as compared to the mutual informatio and the log likelihood test to approximately the same degree. One reason to account for this is that the true mutual information test accounts for all the cells in the contingency table, but the mutual information is only concerned with one of the cells, so the influence of the frequency counts differ. This is also the case with user2.pm which only considers certain cells and not all of them. OVERALL RECOMMENDATION ---------------------- Based on the experiments carried out, I would recommend the user2, mi and ll measures of association since they are very similar. I came up with this conclusion from the results of the user2.pm. From the rank.pl program it was seen that it is very similar to ll and mi and so they should produce approximately similar results in term of rankings. The output of tmi was not very good in term of the top entries which were not very meaningful and seemed to ordinary bigrams with no association. Also, I noticed that the user2.pm (t-test) converges to the z-test when the degrees of freedon are large as is in these experiments. This can also be seen from the t-distribution confidence interval values. ********************************************** *****************Experiment 2***************** ********************************************** This experiment involves collocations composed of 3 word sequences (trigrams). The first test to implement is the log-likelihood test modified to consider trigrams. The second test is the t-test measurement (user3.pm) which will be extended to acomodate for three variables instead of two. As a result it will be able to handle 2x2x2 contingency tables. The following experiments were carried out: TOP 50 RANK ----------- Example run of the program: --------------------------- count.pl --ngram 3 CORPUS_3N.cnt CORPUS_3N statistics.pl --ngram 3 user3 CORPUS_3N.user3 CORPUS_3N.cnt top 26 entries shown for each test user3 ----- 1206149 _ma<>chere<>amie_<>1 1206149.0000 1 1 1 1 1 1 1 YZ<>MN<>OP<>1 1206149.0000 1 1 1 1 1 1 1 DEN<>WILD<>ZEE<>1 1206149.0000 1 1 1 1 1 1 1 saute<>aux<>champignons<>1 1206149.0000 1 1 1 1 1 1 1 ai<>trop<>dit<>1 1206149.0000 1 1 1 1 1 1 1 Cuesta<>del<>Burro<>1 1206149.0000 1 1 1 1 1 1 1 se<>laisser<>aller<>1 1206149.0000 1 1 1 1 1 1 1 vais<>seux<>cella<>1 1206149.0000 1 1 1 1 1 1 1 EXTRA<>PRIVATE<>LESSONS<>1 1206149.0000 1 1 1 1 1 1 1 WX<>GH<>IJ<>1 1206149.0000 1 1 1 1 1 1 1 WHICH<>CHRISTIAN<>MEETS<>1 1206149.0000 1 1 1 1 1 1 1 SPIRITUAL<>INTERFERENCE<>Struck<>1 1206149.0000 1 1 1 1 1 1 1 cauld<>kail<>hae<>1 1206149.0000 1 1 1 1 1 1 1 MN<>OP<>QR<>1 1206149.0000 1 1 1 1 1 1 1 COINCIDENCES<>SUGGESTING<>SPIRITUAL<>1 1206149.0000 1 1 1 1 1 1 1 Todos<>los<>Santos<>1 1206149.0000 1 1 1 1 1 1 1 AB<>CD<>EF<>1 1206149.0000 1 1 1 1 1 1 1 SUGGESTING<>SPIRITUAL<>INTERFERENCE<>1 1206149.0000 1 1 1 1 1 1 1 DAN<>BAXTER<>GIVES<>1 1206149.0000 1 1 1 1 1 1 1 filet<>saute<>aux<>1 1206149.0000 1 1 1 1 1 1 1 YE<>GENTLE<>READER<>1 1206149.0000 1 1 1 1 1 1 1 STEVE<>OFFERS<>CONGRATULATIONS<>1 1206149.0000 1 1 1 1 1 1 1 CD<>EF<>ST<>1 1206149.0000 1 1 1 1 1 1 1 _la<>personne<>indiquee_<>1 1206149.0000 1 1 1 1 1 1 1 BAXTER<>GIVES<>AID<>1 1206149.0000 1 1 1 1 1 1 1 Intellectual<>Peach<>Parer<>1 1206149.0000 1 1 1 1 1 1 1 ll3 ---- 1206149 .<>,<>the<>1 7786623.9583 2 53260 79083 52108 133 1502 2230 .<>.<>,<>2 7751902.6629 89 53260 53260 79083 1603 4088 133 ,<>of<>,<>3 7715136.9923 2 79083 26941 79083 505 5551 49 ,<>I<>,<>4 7394871.8862 3 79083 22711 79083 3598 5551 196 ,<>and<>,<>5 7341499.8465 333 79083 31336 79083 14223 5551 651 ,<>in<>,<>6 7192382.4111 1 79083 15673 79083 1132 5551 235 ,<>was<>,<>7 7079014.9195 6 79083 12419 79083 500 5551 249 ,<>that<>,<>8 7026236.7496 45 79083 12135 79083 1212 5551 472 ,<>,<>one<>9 6962526.4875 1 79083 79083 3502 2 248 198 ,<>you<>,<>10 6904810.7321 5 79083 9087 79083 858 5551 605 ,<>had<>,<>11 6904386.0245 4 79083 7956 79083 277 5551 83 two<>,<>,<>12 6891919.9773 1 1611 79083 79083 56 147 2 ,<>he<>,<>13 6880522.0448 1 79083 8746 79083 1592 5551 153 ,<>for<>,<>14 6873635.0312 16 79083 7959 79083 990 5551 96 ,<>s<>,<>15 6859784.9031 1 79083 6644 79083 1 5551 105 ,<>as<>,<>16 6843698.9151 9 79083 8024 79083 1899 5551 30 square<>,<>,<>17 6837502.3425 1 174 79083 79083 23 11 2 ,<>,<>V<>18 6834679.2425 1 79083 79083 79 2 4 1 ,<>is<>,<>19 6830741.7429 1 79083 6389 79083 270 5551 222 ,<>my<>,<>20 6823826.3993 3 79083 6056 79083 330 5551 9 and<>,<>the<>21 6821418.3294 5 31336 79083 52108 651 1409 2230 ,<>be<>,<>22 6788346.9271 1 79083 5008 79083 43 5551 74 ,<>of<>.<>23 6773018.1244 1 79083 26941 53260 505 1716 47 ,<>she<>,<>24 6769209.9658 1 79083 5660 79083 1158 5551 86 ,<>me<>,<>25 6762351.7502 3 79083 5225 79083 25 5551 935 ,<>have<>,<>26 6762303.2148 2 79083 4391 79083 59 5551 45 Here we have also a similar case to the bigram situation. It seems that in this case the ll3 measure produces top results with meaningless collocations that are merely trigrams.On the other hand the user3 test produces top results that consisted or alphabetic characters but the majority of them , I believe, are not meaningful collocations but just happened to be next to each other. However, there were a some good exampls of collocations such as "Electromotive<>Force<>Resistance<>" to describe a certain force. Here, I would say that the user3 test performed better with regard to the top 50 rankings. CUTOFF POINT ------------ The cutoff point for these two measures differ tremendously. Even the rankings are vastly different. Consider the following trigram and its associated rank. In the case of the ll3 it is ranked at 713651 but in the case of user3 it is ranked as 1. This is a tremendous difference. Intellectual<>Peach<>Parer<>713651 - ll3 Intellectual<>Peach<>Parer<>1 - user3 For the ll3 we start to see trigrams that consist of words made up of alphanumeric characters halfway through the rankings. By contrast in the user3 ranking table we see this phenomena from the top of the table. I believe that this is because the likelihood measurement considers both the expected and observed frequencies but the t-test considers the sample mean and variance together with the distribution mean which equals the probability of obtaining an element at random with the same mean and variance as that of the NULL hypothesis (H0).In trigrams the marginal tables are very big because of the large number of combinations and so this will influence the result of the ll3. I could not find a clear cutoff point for the ll3 rankings. It seemed that there were a mixture of good collocations mixed everywhere in the table. There was not a clear cut breaking point. OVERALL RECOMMENDATION ---------------------- In this experiment I will also recommend the user3 measure (t-test) because it identifies collocations better as compared to the ll3 measure. REFERENCES: ---------- - Foundations of statistical Natural Language Processing - Manning and Schutze - The analysis of Cross-Classified Categorical Data - Fienberg - Measures of Association - Albert M. Liebetrau - Statistical Tachniques for Data Analysis - John Keenan Taylor - Multiway Contingency table analysis for the social sciences - Thomas D. Wickens ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Name : Nitin Agarwal Date 10-Oct-02 Class : Natural Language Processing CS8761 ========================================================= Objective To find a test of measure such that it holds good for bi-grams as well as trigrams and implement it on a corpora, and then finally compare this measure with some standard measures. Procedure First, N-gram Statistics package (NSP) version 0.51 needs to be downloaded and installed. Once, this has been done, a test of measure needs to be found that can be used with both 2 and 3 word sequences. Now, we need to implement this test of measure on a huge corpus to identify the 2 and 3 word collocations in the corpus. The implementation will consist of four .pm files namely tmi.pm, user2.pm, ll3.pm and user3.pm. Corpus The corpus used for this assignment is corpus.txt. Measure of association T-test has been used. The formula for T-test is as given below t-score = O11-E11/sqrt (O11) where O11 is observed value and E11 is expected value The measure of association was found on http://www.collocations.de/EK/am-html/ Conventions used In this experiment, all the output files have an extension .op2 and .op3 depending on whether it is for a bi-gram or a tri-gram respectively. Output file name is same as the name of the file being processed. Experiment 1 ------------ TOP 50 COMPARISON Fiftieth bi-gram in tmi.op2 is ranked at 29, whereas in user2.op2 this row is occupied by a rank 49 bi-gram. Hence, we observe that in the latter file, except one all other bi-grams up to the rank of 50 have unique scores which is not the case with tmi.op2. From the output files it can be seen that when compared to top 50 collocations in tmi.op2, user2.op2 has more collocations with higher frequency of occurring together. CUTOFF POINT In user2.op2, from the line 45312 the value of the score goes negative. All the tri-grams above this line have positive value. I could not find any cutoff point for ll.op2. The possible reason behind this is that the scores calculated using this measure are pretty high and hence do not reach a negative value. RANK COMPARISON tmi.op2 Vs mi.op2 ----------------- Rank correlation coefficient = 0.0874 As the value of the rank correlation coefficient is just above 0, the two files almost unrelated to each other and have very small amount of relatedness. One important point worth noticing is that in mi.op2, all of first 121 bi-grams have the same rank 1. However, conversely in tmi.op2 most of the bi-grams on the top of the list have unique scores. As we move down the list, some of the bi-grams seem to be sharing the same rank and the numbers of these bi-grams seem to be increasing at an almost steady rate. Looking at the 3 right-most columns, it can be noticed that they all have value 1 for first ranked bi-grams in mi.op2, whereas in tmi.op2 the value for the number of occurrences (third column of numbers) does not seem to be following any progression. This shows that mi.op2, lists all the collocations that occur once in the beginning. One interesting point, that I noticed which may not be very relevant is that, in tmi.op2 the score of the highest ranked bi-gram is 0.0387 which is close to the inverse of the highest ranked bi-gram of mi.op2 whose score is 17.0977. --output of tmi.pm ,<>and<>1 0.0387 2634 11467 4775 I<>had<>2 0.0174 810 5158 1546 ;<>and<>3 0.0149 844 2330 4775 of<>the<>4 0.0086 817 3558 5854 --output of tmi.pm SEND<>MONEY<>1 17.0977 1 1 1 isle<>Fernando<>1 17.0977 1 1 1 COUP<>DE<>1 17.0977 1 1 1 repenting<>prodigal<>1 17.0977 1 1 1 tmi.op2 Vs ll.op2 ----------------- Rank correlation coefficient = 0.1916 Even in this comparison the rank correlation coefficient is positive but a very low value. Hence, we infer that again the two output files have only a small degree of relatedness, although this time it is more than the previous comparison. This time again for ll.op2 the all the collocations on the top of the list occur only once. In tmi.op2 the highest score is 0.0387 and the lowest value is 0. Whereas, in the case of ll.op2, highest value is 7516, which again goes all the way down to 0. Hence, we observe that the gradient for ll.op2 is much steeper than the gradient for tmi.op2. --output of ll.pm ,<>and<>1 7516.6877 2634 11467 4775 I<>had<>2 3387.5897 810 5158 1546 ;<>and<>3 2893.9474 844 2330 4775 ;<>but<>4 1675.1604 349 2330 958 user2.op2 Vs mi.op2 ------------------- Rank correlation coefficient=0.5824 The rank of 0.5824 shows that the two files are neither completely unrelated nor they match perfectly. They are somewhere between the two. As obvious from the rank value, user2.op2 is closer to mi.op2 when compared with tmi.op2. One similarity between the two files is that towards the end, the scores attain negative values in both the files. Again in this comparison, in user2.op2 in the beginning all the scores are unique which is not the case with mi.op2. --output of user2.pm ,<>and<>1 43.7160 2634 11467 4775 I<>had<>2 26.4628 810 5158 1546 ;<>and<>3 26.3213 844 2330 4775 of<>the<>4 23.3878 817 3558 5854 user2.op2 Vs ll.op2 ------------------- Rank correlation coefficient=0.8393 As seen from the coefficient value for these 2 files, they are nearly a perfect match. Most of the rows in the two files have nearly the same collocations. The only major difference between the two files is the value of the scores for the individual bi-grams which is far apart in the beginning gets closer towards the middle and again starts to diverge at the end. Once more, user2.op2 is closer to ll.op2 which is clear from its rank correlation coefficient. OVERALL RECOMMENDATION I would suggest ll.pm as a good measure for identifying significant collocations. As I have discussed above user2.op2 and ll.op2 have unique ranks for most of the bi-grams. The total number of unique ranks for user2.op2 is 12767 whereas for ll.op2 it is 19482. Since ll.op2 has more unique ranks than user2.op2 it is a better measure of association. Experiment 2 ------------ TOP 50 RANK In ll3.op3, most of the top 50 elements have rank 1, whereas in user3.op3, most of the top 50 entries have different ranks. ll3.op3 lists all the tri-grams which occur once at the bottom, whereas user3.op3 lists a trigram that occurs 6 times first and then this number starts decreasing before it starts increasing again towards the end of the file. --output of user3.pm by<>Daniel<>Defoe<>1 2.3877 6 590 6 6 6 6 6 PROJECT<>GUTENBERG<>tm<>2 2.2353 5 7 7 5 7 5 5 Small<>Print<>!<>3 2.2287 5 5 5 92 5 5 5 this<>Small<>Print<>4 2.1764 5 749 5 5 5 5 5 Illinois<>Benedictine<>College<>5 1.9998 4 4 4 4 4 4 4 --output of ll3.pm data<>,<>transcription<>1 392089.1000 1 1 11467 1 1 1 1 ,<>shame<>opposed<>1 392089.1000 1 11467 1 1 1 1 1 los<>Santos<>,<>1 392089.1000 1 1 1 11467 1 1 1 computerized<>population<>,<>1 392089.1000 1 1 1 11467 1 1 1 temperance<>,<>moderation<>1 392089.1000 1 1 11467 1 1 1 1 CUTOFF POINT In user3.op3, from the line 8634 the value of the score goes negative. All the tri-grams above this line have positive value. In the case of ll3.op3, there doesn't seem to be any cutoff value. OVERALL RECOMMENDATION For tri-grams, I would suggest user3.op3 for the same reasons that were given for bi-grams. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Kailash Aurangabadkar CS8761 Assignment #2 -------------------------------------------------------------------------------- The objective of the assignment is to investigate various measures of association that can be used to identify collocations in large corpora (CORPUS) of text. The assignment is essentially to find out interesting co-occurrences of two or more words i.e. their co-occurrence is not coincidental. Such words group to form a collocation. -------------------------------------------------------------------------------- Collocations: A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. Examples of Collocations: Collocations include noun phrases like "strong tea" and "weapons of mass destruction", phrasal verbs like "to make up", and other stock phrases like the "rich and powerful", "a stiff breeze" but not a "stiff wind", "broad daylight" etc. Criteria for Collocations Typical criteria for collocations: non-compositionality, non- substitutability, non-modifiability. Collocations cannot be translated into other languages word by word.A phrase can be a collocation even if it is not consecutive (as in the example knock . . . door). Compositionality A phrase is compositional if the meaning can predicted from the meaning of the parts. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. E.g. strong tea. Idioms are the most extreme examples of non-compositionality. E.g. to hear it through the grapevine. Non-Substitutability We cannot substitute near-synonyms for the components of a collocation. Non-modifiability Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. ---------------------------------------------------------------------------------- Process:- The assignment is an addition to the NSP package currently available and developed by Satanjeev Banerjee. The implementation consists of four .pm files that will be used by statistic.pl program that comes with NSP. These .pm files will only run when used with NSP. ---------------------------------------------------------------------------------- The assignment consists of two experiments:- Experiment 1: To implement (true) mutual information and a test of association, that is not implemented in NSP, for bigrams. The association tests implemented in NSP are: Left Fisher, Chi-Squared, Dice, Point wise Mutual Information, and Log Likelihood. I have selected to implement u-test for collocations. The implementation of u-test is done in user2.pm. I have also implemented the student t-test for association. The implementation for t-test is done in tscore.pm. After implementing the three tests I executed these on bigrams collected from CORPUS, which was obtained by concatenating all the text files in BROWN1. The null hypothesis for the u-test and the t-test is that the two words are independent. Top 50 comparisons: I have executed the user2.pm, tmi.pm and tscore.pm on a corpus of text which consists of 1336151 bigrams which is the concatenation of all text files in /home/cs/tpederse/CS8761/BROWN1. This text is available at /home/cs/aura0011/CS8761/nsp/CORPUS . User 2: We find the following significant collocations in the top 50 ranks when executed on CORPUS: Breezy clotheslines, Peace Corps, Chemical Name, Southern California, final solution, policy trends, Agatha Christie, Hong Kong, Rhode Island, export import, testimony whereof, Electronic Circuits, Polypropylene glycol, barbarian hordes, output control, polio vaccine, freight car, armed force etc. It is observed that the u-test gives significant collocations, but also gives a lot of insignificant ones with them. The u test is dependent on the independent and joint frequency of the two words. If the two words tend to occur together always and each word in that collocation occurs only rarely in any other construct then u-test recognizes them as good candidates of collocation. But collocation made up of two more frequent words like bitten and down are ranked low as the two words making up the collocation also occur without each other quite often. These are the some of the entries in the output file created by user2.pm 1336151 ..... Rosy<>Fingered<>1 1155.9191 1 1 1 dabhumaksanigalu<>ahai<>1 1155.9191 1 1 1 Alcohol<>ingestion<>1 1155.9191 1 1 1 URETHANE<>FOAMS<>1 1155.9191 1 1 1 breezy<>clotheslines<>1 1155.9191 1 1 1 ..... MICROMETEORITE<>FLUX<>26 943.8026 2 2 3 Unifil<>loom<>27 943.8005 4 4 6 PEACE<>CORPS<>27 943.8005 4 6 4 JUNE<>18_<>27 943.8005 4 6 4 ...... DEAE<>cellulose<>30 909.4587 13 13 21 Ham<>Richert<>31 895.3684 3 3 5 Final<>Solution<>31 895.3684 3 5 3 Sheraton<>Biltmore<>31 895.3684 3 5 3 Pro<>Arte<>31 895.3684 3 3 5 T score: There are no any significant collocations in T score as it is based on frequency of occurrences. Some enteries are:- 1336151 of<>the<>1 78.1686 9181 36043 62690 .<>The<>2 66.7814 4961 49673 6921 in<>the<>3 60.4230 5330 19581 62690 ,<>and<>4 57.7736 5530 58974 27952 .<>He<>5 46.9438 2421 49673 2991 on<>the<>6 40.4487 2196 6405 62690 .<>It<>7 39.4899 1718 49673 2184 to<>be<>8 37.3632 1631 25725 6340 ,<>but<>9 37.3273 1648 58974 3006 to<>the<>10 36.0291 3266 25725 62690 .<>In<>11 34.5555 1320 49673 1736 .<>I<>12 34.0372 1565 49673 5877 at<>the<>13 31.9296 1448 4966 62690 .<>But<>14 31.8652 1115 49673 1371 for<>the<>15 30.4946 1655 8833 62690 from<>the<>16 29.6802 1243 4190 62690 Mutual Information:- We find the following significant Collocations in the top 50 collocations: United States, Rhode Islands, Peace corps, Bang Jensen, Air Force, Fiscal year, nineteenth century, President Kennedy, United Nations, Civil War, New Orleans, The Editor, Kansas city, Nuclear Weapons, High School, Vice President, York city, Export Import, Supreme court, St. Louis etc. We find that the mutual information is good in finding out collocations which are made of proper nouns. Some enteries in the output are as follows:- 1336151 .... on<>the<>10 0.0031 2196 6405 62690 United<>States<>11 0.0029 346 456 447 had<>been<>12 0.0028 721 5096 2470 .... years<>ago<>32 0.0008 126 948 246 Rhode<>Island<>32 0.0008 77 89 129 .... ominant<>stress<>38 0.0002 28 62 108 There<>were<>38 0.0002 74 901 3279 Export<>Import<>38 0.0002 14 15 15 be<>taken<>38 0.0002 52 6340 278 ..... he<>asked<>38 0.0002 57 6799 392 nuclear<>weapons<>38 0.0002 20 110 61 But<>it<>38 0.0002 90 1371 6873 Cutoff Point: User 2: By scanning through the output of the user2.pm module I find the cutoff point approximately when u value is around 12. Below 12 u-value the collocations given are the bigrams made up of common words. T-Score: T-score gives a scattered distribution of collocations, with more frequently occurring bigrams at top and neglecting the lower frequency, but important, ones. Thus a cut off value cannot be suggested for T score. Mutual Information: As mutual information depends on observed and expected frequencies a cut off is difficult to suggest. Rank Comparison: "True" Mutual Information:- When rank.pl is executed with output of "True" mutual information and output of Pointwise Mutual Information the value returned is -0.96, which suggests that mutual information and pointwise mutual information are not the same measures, but in fact are opposite as the corpus size goes on increasing. When I tried them on a smaller corpus of 700000 bigrams then rank.pl returned a value of -0.7. When rank.pl is executed with output of "True" mutual information and output of Log likelihood ratio measure the value returned is -0.96, which suggests that mutual information and Log likelihood are also not similar. In fact these two also go on differing as the Corpus size increases. Mutual Information tends to remove the discrepancies of likelihood ratio measures and pointwise mutual information. So it is a measure not like any of them. User 2: When rank.pl is executed with the values returned by u-test and pointwise mutual information then we find that the value turns out to be 0.9641 while those of u-test with those of Log likelihood ratio is 0.93. The measures pointwise mutual information, Log likelihood ratio and u-test are based on the calculation on the observed and expected values of the bigram. While the "True" mutual information is actually is a calculation on all the values in the contingencies table and try to analyze the Corpus on a whole. Thus it is different than the other three. Overall Recommendation: User2 :- This test is also based on the observed and expected values. But, it is more dependent on the difference between joint frequency and independent frequency. Thus it ranks less frequently occurring bigrams higher than those occurring more frequently. The observations from investigations also suggest that the highly ranked bigrams are those whose joint frequency and independently frequency does not differ. This test gives a relatively good measure of collocations as can be seen from the output. T-Score:- In cases, where n and c are not very frequent (and most words are infrequent), and where the corpus is large, then f(n)f(c)/N, will be a very small number . In such cases, subtracting this number from f(n,c) will make only a small difference. It follows that T = approx f(n,c) / sqrt(f(n,c)) = sqrt(f(n,c)) Therefore the main factor in the value of T is simply the absolute frequency of joint occurrences. The T-value picks out cases where there are many joint occurrences, and therefore provides confidence that the association between n and c is genuine. But, clearly, whether we rank order things by raw frequency or by its square root makes no difference. However, T is sensitive to an increase in the product f(n)f(c). In such cases T = [f(n,c) - X] / sqrt(f(n,c)), where X is significantly large relative to f(n,c). Since T decreases if f(n)f (c) becomes very large, the formula has a built-in correction for cases involving very common words. In practice, this correction has a large effect only with a small number of common grammatical words, especially if they are in combination with a second relatively common word. If the corpus gets larger, but f(n)f(c) stays the same, then f(n)f(c)/N decreases again, and T correspondingly increases. Thus T is larger when we have looked at a larger corpus, and can be correspondingly more confident of our results. Again, this effect is noticeable only in cases where node and/or collocate are frequent. But the cases of frequent collocates also form the drawback of the test. This test does not give importance to infrequent collocations. Mutual Information: Mutual Information score expresses the extent to which observed frequency of co-occurrence differs from expected (where expected means "expected under the null hypothesis"). But, there is problem with Mutual Information suppose a word appears just once with another word (which is not an unreasonable event) in the corpus. And the first word may have a very low overall frequency. Say the first word occurs only once i.e. in a bigram with the second word. Now we carry out the sums to calculate the expected value of the bigram and if corpus size is large, we get a very small expected value (about 1/1000). Now we see that even if the first word occurs just once with the second word the observed frequency will be 1000 times more than the expected joint frequency, and the mutual information value will be high. Also when I experimented by executing the count module by setting a window size of 3 for bigrams, I found some interesting bigrams like "rescue prisoners". These were not present in the output with default window size as there are some two-word collocations that have some other constant number of words between them. Like the one mentioned (rescue prisoners). It is not present in the output in the first case as a phrase like "rescue the prisoners" might have occurred. This experiment was executed on a corpus of smaller size as it the number of bigrams increases exponentially with increase in window size. -------------------------------------------------------------------------------- Experiment 2: To implement a module named ll3.pm that performs log likelihood ratio for 3 word sequences. To create the 3 word version of user2.pm and name it user3.pm. A 3 word version of the u-test is implemented in user3.pm. For both of these the null hypothesis is extended from that for two words to three words i.e. all the three words are independent. Top 50 Rank: User3: We find the following interesting collocations in the top 50 ranked trigrams: Getting along with, Rural Road Authority, Plant feeding facilities, Sound Tax policy, United Nations Day, long term approach, Deal with principles, state automobile practices, on unpaid taxes, peace corps volunteers, middle Atlantic states etc. It gives good candidates to be considered as collocations, but is not accurate and tend to give the less frequent word combinations in higher position than more frequent ones. Some of the entries are.: 1336150 _FRANKFURTER<>TWISTS_<>Blend<>1 1336150.0000 1 1 1 1 1 1 1 DEFINE<>INPUT<>OUTPUT<>1 1336150.0000 1 1 1 1 1 1 1 .... BED<>PRESSED<>OR<>38 243946.4984 1 2 1 15 1 1 1 PUSH<>UPS<>BUT<>38 243946.4984 1 2 3 5 2 1 1 PLANT<>FEEDING<>FACILITIES<>38 243946.4984 1 3 2 5 1 1 1 OR<>SODIUM<>HEXAMETAPHOSPHATE<>38 243946.4984 1 15 2 1 1 1 1 .... Dried<>rumen<>bacteria<>41 236200.1814 1 2 2 8 1 1 1 broiled<>steaks<>tantalizing<>41 236200.1814 1 2 4 4 1 1 1 GETTING<>ALONG<>WITH<>41 236200.1814 1 2 1 16 1 1 1 ..... INORS<>MUST<>ALSO<>41 236200.1814 1 2 8 2 1 1 1 0895<>NATURAL<>GAS<>41 236200.1814 1 4 4 2 1 1 1 SOUND<>TAX<>POLICY<>41 236200.1814 1 1 8 4 1 1 1 Accepted<>crystallographic<>symbolism<>41 236200.1814 1 2 2 8 1 1 1 Log likelihood 3: There are not so many interesting trigrams in top 50 rankings of the ll3.pm output. Following are the first 15 entries in the output 1336150 ,<>.<>The<>1 27511.2605 7 58974 49673 6921 57 30 4961 of<>.<>The<>2 25527.1297 2 36043 49673 6921 25 2 4961 a<>.<>The<>3 24045.6938 1 21998 49673 6921 15 2 4961 to<>.<>The<>4 24041.1144 3 25725 49673 6921 47 5 4961 .<>The<>.<>5 24006.2920 1 49673 6921 49674 4961 335 1 in<>.<>The<>6 23023.9802 11 19581 49673 6921 81 13 4961 is<>.<>The<>7 22535.4368 2 9969 49673 6921 29 3 4961 he<>.<>The<>8 22493.5350 1 6799 49673 6921 3 3 4961 for<>.<>The<>9 22438.9843 4 8833 49673 6921 28 4 4961 was<>.<>The<>10 22413.8054 1 9770 49673 6921 37 5 4961 with<>.<>The<>11 22337.1213 2 7007 49673 6921 19 2 4961 his<>.<>The<>12 22239.1767 3 6469 49673 6921 21 5 4961 at<>.<>The<>13 22215.6165 1 4966 49673 6921 10 1 4961 from<>.<>The<>14 22191.3867 1 4190 49673 6921 5 1 4961 had<>.<>The<>15 22155.1802 2 5096 49673 6921 17 2 4961 Cutoff point: It is difficult to spot collocations for trigrams as there are more possibilities of word combinations. So it is not quite easy to find out the cutoff point. User3: As we skim through the output of user3.pm we find that the cutoff point occurs somewhere in the range where the value of u - test is greater than 2000 for the current CORPUS. The entries having u-value more than 2000 are composed of less frequent words. The entries below 2000 value have more frequent words like THE, IS, or a punctuation as a word in them and hence are weak candidates of collocations. Log likelihood ratio 3: It gives a rather scattered distribution of the collocations and hence a cutoff cannot be suggested. Rank Comparison: When rank.pl is executed on outputs of Log likelihood ratio and output of User3 we get a value of 0.4726. This suggest that both the tests are not related to each other. Overall Recommendations: The u-test for collocations gives a good result for finding collocations that do not occur frequently. To find out the infrequent word combination collocations we have to carefully go through the output created. By infrequent word combination collocation we mean the collocations formed by words which occur independently more often than they occur in the collocation. The Log likelihood ratio measure gives better results for trigrams that occur more often than those which occur less often. But the trigrams which occur most are made up of common articles (like The) or verbs (like Is) or punctuation marks. Thus the high ranked trigrams in the output of ll3.pm are not significant collocations. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ CS8761 Natural Language Processing Assignment 2 Archana Bellamkonda October 11, 2002. Problem Description : -------------------- The objective is to investigate various measures of association that can be used to identify collocations in large corpora of text. Following the investigation, a measure is selected and implemented so that it can be used with 2 and 3 word sequences to find the association between the words. This measure is also compared with some other standard measures and subsequent analysis is done. [ Measure of Association is given by a value calculated from the method selected and that value determines the chance of the words in a bigram or trigaram ( or n-gram in general) to be together, i.e. we get information about mutual dependence between the words. ] Input and Output Description : ---------------------------- Our aim is to calculate a score for each bigram or trigram that determines the associativity between the words. In this experiment, we write modules for calculating value based on true mutual information (tmi.pm), log-likelihood ratio for trigrams (ll3.pm), and a new measure for bigrams (user2.pm) and trigrams (user3.pm). Program, statistic.pl in NSP package calculates these values for each bigram or trigram, where we have to specify name of the module that we are using, a file to write the results into and an input file that is in the format output by count.pl in NSP package. [count.pl finds all n-grams in a file and writes them into the file specified, along with various frequency values associated with each n-gram]. Output of count.pl is used as input for statistic.pl. Example: ------ If abc.txt is the text on which we want to anaylse, then we can run a simple count.pl with the following syntax: perl count.pl result.txt input.txt Above syntax is for bigrams. If we need the same for trigrams, NSP package allows us to write, perl count.pl --ngram 3 result3.txt input.txt [Other ways to run are specified in the package. ] Once we get result.txt or result3.txt, containing all bigrams or trigrams along with some frequency values, we use that file as input to statistic.pl, that will calculate score for each bigram or trigram and store the results in an output file specified by the user. We should also specify the method that is to be used to calculate scores. To calculate scores using method specified in module user2.pm, we can write the simple command, perl statistic.pl user2 user2.res result.txt For trigrams, we can run statistic.pl as perl statistic.pl user3 --ngram 3 user3.res result3.txt Thus, we can compute scores that tell about association between words in an n-gram and get results in a file specified by user. Description of Method Used : -------------------------- user2.pm : -------- In user2.pm, I used Odds Ratio as a measure of associativity. [ Found in book, Categorical Data Analysis by ALAN AGRESTI, which I found by searching in library ] If X<>Y is a bigram, then Odds Ratio for it is given by the formula, Odds Ratio = ( frequency of X and Y occuring together * frequency of both X and Y not occuring together) / ( frequency of X occuring and Y not occuring * frequency of Y occuring and X not occuring) Note : Above, when we mention about occurance of X or Y, we are mentioning occurance of X in the first position and occurance of Y in second position for a bigram. As we have frequency of X and Y occuring together in numerator, and frequency of both X and Y not occuring together (which means that if this value increases, there is less chance that there will be bigrams that dont contain X in first position and Y in second position) in numerator, if these values increases, intuitively, there is stronger association between words and hence Odds Ratio will increase. Similarly, if frequency of either X or Y occuring in the correct position increases, it means that frequency of X and Y not occuring together increases and hence association is low. As we have these frequencies in denominator, Odds Ratio decreases, and hence it makes sense. Thus, Odds Ratio is calculated for each bigram and we can know the association between two words in a bigram depending on Odds Ratio. Greater the ratio, greater is the association between them. Example :- Say, i<>j is a bigram, and we have, (first bigram in output given by count.pl when run on Brown.txt) nij = number of times i<>j occured in the sample = 9181 ni+ = number of times i<>+ occured in the sample (second word can be anything) = 36043 n+j = number of times +<>j occured in the sample (first word can be anything) = 62690 total = total number of bigrams in the sample = 1336151 now we will get, nibarj = number of times i occured in first place where j is not there in second place = n+j - nij = 62690 - 9181 = 53509 nijbar = number of times j occured in second place where i is not there in first place = ni+ - nij = 36043 - 9181 = 26862 nibarjbar = number of times when neither of i and j occured = total - ni+ - n+j + nij = 1336151 - 36043 - 62690 + 9181 = 1246599 Odds Ratio = nij * nibarjbar / (nijbar * nibarj) = 9181 * 1246599 / ( 26862 * 53509) = 7.9625 user3.pm : ------- The same concept of Odds Ratio can be extended to apply for trigrams. Let X<>Y<>Z be a trigram. Mantel and Haenszel proposed the formula for calculating Odds Ratio for trigrams to be Odds Ratio = (for all k, (n11k * n22k / n++k) ) / (for all k, (n12k * n21k / n++k) ) [Page 236 in the book, Categorical Data Analysis] where n111 = frequency of occurance of trigram XYZ, n112 = frequency of occuracne of trigram where X,Y are in first and second positions and Z is not in third position, n121 = frequency of occurance of trigram where X,Z are in first and third positions and Y is not in second position, n122 = frequency of occurance of trigram where X occurns in first place and Y,Z are not in second and third positions, . . . . and so on n++1 = n111 + n121 + n211 + n221 and n++2 = n112 + n122 + n212 + n222 and k = 1,2 for trigram. Odds Ratio makes sense for trigrams also as in bigrams as increase in value of numerator implies greater associativity intuitively. Example :- ------ total = 1071112 n111 = xyz = 2179 xpp = 35891 pyp = 35891 ppz = 35891 xyp = 3751 xpz = 2760 pyz = 3751 n121 = xybarz = xpz - xyz = 2760 - 2179 = 581 n211 = xbaryz = pyz - xyz = 3751 - 2179 = 1572 n112 = xyzbar = xyp - xyz = 3751 - 2179 = 1572 n221 = xbarybarz = ppz - n121 - n211 - xyz = 35891 - 581 - 1572 - 2179 = 31559 n122 = xybarzbar = xpp - n112 - n121 - xyz = 35891 - 1572 - 581 - 2179 = 31559 n212 = xbaryzbar = pyp - n112 - n211 - xyz = 35891 - 1572 - 1572 - 2179 = 30568 n222 = total - xpp - pyp - ppz + xyp + xpz + pyz - xyz = 1071112 - 35891 - 35891 - 35891 + 3751 + 2760 + 3751 - 2179 = 97152 npp1 = n111 + n121 + n211 + n221 = 2179 + 581 + 1572 + 31559 = 35284 npp2 = n112 + n122 + n212 + n222 = 1572 + 31559 + 30568 + 97152 = 160851 odds ratio = (npp2.n111.n221 + npp1.n112.n222) / (npp2.n121.n211 + npp1.n122.n212) = EXPERIMENT 1 : ------------ Once user2.pm is implemented, it is run with NSP to perform analysis on the following. TOP 50 COMPARISION : ------------------ --> Number of Ranks : --------------- On observing the ranks produced by user2.pm and tmi.pm, and considering the top 50 ranks, it is observed that there are only 40 ranks for the text I gave when we use tmi.pm. So, I couldnt even observe 50 ranks even though the text consists of more than a million tokens. But there are 131972 ranks when user2 module is used. Analysis : The reason for getting less number of ranks is the value of the score being zero for most of the bigrams. The score is zero implies that (observed frequency of 2 words being together) / (expected value for that words being together) is negative. Because only then the value is not calculated in the logic of the program. [log is calcualted only for values greater than 0.] There is no chance that expected value is negative. [Since it is product of left and right frequencies of a bigram divided by total number of bigrams and all these values are taken from count.pl and they are never negative.]. SO, the only possibility is that observed frequency for 2 words being together can be zero but never negative. So, this implies that if the value, (observed frequency of 2 words being together) / (expected value for that words being together) is valid, then only possibility to get zero would be if observed frequency is equal to expected frequency. [because log 1 = 0, and that is the only way we can get zero] NOTE: ------ Formula used for tmi is: score = for all X and Y ( (nij / n++) * log ( nij / eij) ) where log is to base 2. Conclusion for this Analysis : Using tmi, we are observing that observed value is equal to expected in relatively more number of cases which means that words in bigram are independent. Thus, Null Hypothesis can be accepted in this case. --> Comparision between user2 and tmi : --------------------------------- I observed the bigrams that are ranked high using tmi are ranked low using user2. If we are considering a particular text and performing analysis on it, I feelt that user2 would be more interesting than tmi as it gives wider rankings, so that we can have better intuition regarding words of that particular text. In my experiment, I observed that using user2, bigrams that are some sort of names, or common phrases (atleast to that text) are ranked high. Using tmi, I observed that bigrams that are more common in general language, for example, ".<>The<>" (ranked 1) and not to particular text are ranked high. Hence, I feel that user2 would be significantly better than tmi for analysis of texts of particular author, or particular category, etc but tmi would be better than user2 to perform analysis on a language rather than on a specific text. CUTOFF POINT : ----------- tmi.pm : For scores generated using tmi module, 0.0135 is the score for first rank in my experiment. And the score goes on decreasing with increase in rank and at a stage it remains constant once it reaches zero. We can neglect all the bigrams whose score is zero. This means that the words in the bigrams with score zero have no dependence or association and hence we can neglect them. Hence the Cut-Off point will be the point where we encounter zero as value for measure. user2.pm : For scores generated using user2 module, 1336150 is the score for first rank in my experiment. Thereon, score goes on decreasing as rank increases but I didnt observe any cut off point beyond which I can neglect the values. I didnt find a cut off as the scores are not normalized or within a certain boundary. When there are boundaries, we can find cutoff values at the boundaries. But here as we lack specific boundaries, we are unable to determine the cutof point. This tells us that this test not suitable for all texts. The reason is, as the size of corpora increases, the value of odds ratio goes on increasing and this is not very desirable. Escpecially when we have cases where a bigram occurs only once in a very large corpora, the value of (frequency of occurances without X and Y) in the numerator increases considerably and we dont have a way to normalize this. Thus, we can tell that Odds Ratio test is suitable only for texts of reasonable size with accpetable frequency values for the occurances of bigrams. Also, we can conclude that we didnt find a cutoff using user2.pm as it is a variation of size. When we have a variant of size, we can have a cut off generally by using a function of size, and cutoff changes with size. But here, it is not clear what the relation is between size of the text and Odds Ratio. Hence, we am unable to determine a cutoff for this experiment. RANK COMPARISION : ---------------- Comparision with ll.pm :- The rank corelation coefficient on comparing ll.pm with tmi.pm is -0.9668 The rank corelation coefficient on comparing ll.pm with user2.pm is 0.8349 Comparision with mi.pm :- The rank corelation coefficient on comparing mi.pm with tmi.pm is -0.9668 The rank corelation coefficient on comparing mi.pm with user2.pm is 1.0000 Observing the rank co-relation coefficients above, we can conclude that user2 gives exactly the same rankings as mi.pm, which means that there is high relatedness between mi.pm and user2.pm. Also, user2.pm has high relatedness to ll.pm. Thus we can tend to infer that user2 is a good measure of association as it has good co-relation with the existing measures. This is true as long as the drawback observed when determining cufoff point above is not significant. --> Related but Negative value :- -------------------------- One interesting observation made is that comparing with tmi.pm gives us negative coefficient. But if we observe actual data in the results of both modules, they have very similar rankings. For example, consider first 10 rankings given by ll.pm and tmi.pm : ll.pm: ---- 1336151 .<>The<>1 25043.4196 4961 49673 6921 of<>the<>2 18856.1038 9181 36043 62690 .<>He<>3 13183.4486 2421 49673 2991 in<>the<>4 11393.6840 5330 19581 62690 .<>It<>5 9139.3156 1718 49673 2184 ,<>and<>6 9072.9347 5530 58974 27952 .<>In<>7 6844.0270 1320 49673 1736 ,<>but<>8 6309.5959 1648 58974 3006 the<>the<>9 6126.7892 3 62690 62690 .<>But<>10 6064.5057 1115 49673 1371 tmi.pm : ----- 1336151 .<>The<>1 0.0135 4961 49673 6921 of<>the<>2 0.0102 9181 36043 62690 .<>He<>3 0.0071 2421 49673 2991 in<>the<>4 0.0062 5330 19581 62690 .<>It<>5 0.0049 1718 49673 2184 ,<>and<>5 0.0049 5530 58974 27952 .<>In<>6 0.0037 1320 49673 1736 ,<>but<>7 0.0034 1648 58974 3006 .<>But<>8 0.0033 1115 49673 1371 the<>the<>8 0.0033 3 62690 62690 to<>be<>9 0.0032 1631 25725 6340 on<>the<>10 0.0031 2196 6405 62690 So, it is clear that both the measures must have high relatedness as the rankings are nearly the same. That means, we have to get a value around +1(expected). But we observed a value around -1 (observed value = -0.9668). This is an interesting observation that can be made. The reason for this is, tmi is normailized and so we could observe a cutoff at 0 and all the bigrams thereafter can be neglected as all of them have the same rank (only 40 ranks are observed in this experiment). But ll and mi dont show the same rank to so many bigrams and moreover the scores do depend on size of corpora and we have many rankings. (for example, 106744 rankings are observed in ll). Since difference in rankings is the criteria in calculating the corelation coefficient, these differences produce a negative value. Therefore, though these methods seem to have high corelatedness, they actually end up having high reverse corelatedness. OVERALL RECOMMENDATION : ---------------------- Based on all the observations made, I think the best method for identifying significant collocations in a large corpora among mi.pm, ll.pm, user2.pm and tmi.pm would be "mi.pm" or "ll.pm" user2.pm might not be appropriate in cases where most bigrams dont occur more than once, or where the size of corpora is very large, for the reasons explained earlier during observations. tmi.pm would also be inappropriate compared to mi.pm as it quickly converges to cutoff, which excludes bigrams with scores equal to cut off from consideration. EXPERIMENT 2 : ------------ Once user3.pm is implemented, the following are done. TOP 50 COMPARISION :- ------------------ On observing the top 50 bigrams after running NSP on a corpus using ll3.pm and user3.pm, I see that the two methods are significantly different from each other as the bigrams ranked below 50 using ll3.pm and bigrams ranked below 50 using user3.pm are significantly different. Also, it is observed that score values is relatively high using user3 compared to scores using ll3. This could tell that user3.pm is more affected by size of the text CUTOFF POINT :- ------------ No cut off points are observed for either of the methods, ll3.pm or user3.pm, for reasons similar to the bigram case. OVERALL RECOMMENDATION :- --------------------- Among ll3.pm and user3.pm, a better method would be ll3.pm as we get very high ratios using user3.pm as size of corpora increases and user3 is especially degraded when many trigrams occur only once which causes the ratio to increase considerably, as with the case in user2.pm for bigrams. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ #------------------------[experiments.txt]---------------------------- # Assignment # 2 : CS-8761 Corpus Based NLP # Title : Report for "The Great Collocation Hunt" # Author : Deodatta Bhoite (bhoi0001@d.umn.edu) # Version : 1.0 # Date : 10-10-2002 #--------------------------------------------------------------------- Introduction ------------ "You shall know a word by the company it keeps!" - J.R.Firth The most trivial way of finding out collocations in a text would be counting the number of joint occurrences of two words. However, it is seen that we can find more significant collocations by applying various tests of association to the frequency data (contingency table) of the corpus. In this assignment we investigate various tests of association to identify 2-word and 3-word collocations. Experiment 1 ------------ Corpus for this experiment has been taken from Project Gutenberg and includes the following files: Alice in Wonderland (alice.txt) Black Beauty (blackbeauty.txt) Robinson Crusoe (crusoe1.txt) Further Adventures of Robinson Crusoe (crusoe2.txt) Dracula (dracula.txt) King James New and Old Testament (bible.txt) Vikram and the Vampire (vikram.txt) The total token size of the corpus is around 1.37M tokens. However, I remove the digits and the punctuation from the data by specifying a token file to the count.pl. The digits and punctuation are filtered by the following regular expression. --[token.txt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /[a-zA-Z]+/ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We then count find the bigrams and frequency data related to them by the following command: % count.pl --token token.txt test2.cnt /home/cs/bhoi0001/CS8761/nsp/ The top 50 collocations based on frequency counts are as follows. We observe that they are not significant, which is probably because they do not take into consideration the independent occurrence of the words forming the bigrams. --[test2.cnt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 of<>the<>14731 48646 89427 in<>the<>6985 20407 89427 the<>LORD<>5964 89427 6653 and<>the<>5282 59758 89427 to<>the<>3830 30166 89427 all<>the<>2661 8710 89427 shall<>be<>2551 10449 10176 And<>the<>2249 13476 89427 I<>will<>2044 23430 4922 for<>the<>2028 12188 89427 unto<>the<>2020 8944 89427 out<>of<>1953 4357 48646 of<>Israel<>1701 48646 2578 and<>I<>1700 59758 23430 from<>the<>1674 5455 89427 that<>I<>1665 20250 23430 the<>king<>1662 89427 2695 said<>unto<>1643 6088 8944 on<>the<>1643 4949 89427 And<>he<>1637 13476 15266 with<>the<>1624 10543 89427 of<>his<>1594 48646 12329 I<>have<>1575 23430 6484 into<>the<>1565 3251 89427 to<>be<>1564 30166 10176 by<>the<>1520 4865 89427 and<>he<>1494 59758 15266 I<>had<>1445 23430 6896 that<>he<>1420 20250 15266 son<>of<>1418 2305 48646 children<>of<>1403 1936 48646 it<>was<>1400 12324 11852 the<>children<>1346 89427 1936 and<>they<>1337 59758 10570 the<>house<>1325 89427 2368 the<>son<>1322 89427 2305 the<>land<>1297 89427 1885 upon<>the<>1274 3952 89427 the<>people<>1266 89427 2470 I<>was<>1262 23430 11852 I<>am<>1255 23430 1430 that<>the<>1227 20250 89427 at<>the<>1178 4608 89427 of<>them<>1134 48646 9649 unto<>him<>1126 8944 9642 in<>a<>1120 20407 18937 him<>and<>1107 9642 59758 and<>his<>1101 59758 12329 and<>to<>1090 59758 30166 came<>to<>1080 3349 30166 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have implemented "true" mutual information and Poisson's collocation measure for finding collocations in the corpus. Description of their implementations can be found in the source code comments. We will see the results of the association measures (AM) and their analysis in the following sections. (a) Top 50 Comparison - - - - - - - - - The two word collocations for True mutual information are found as follows: % statistic.pl tmi test2.tmi test2.cnt There are only 44 ranks in the TMI output. So we show top 50 trigrams as they are ordered in the output. --test2.tmi-- //TMI output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 the<>LORD<>1 0.0152 5964 89427 6653 of<>the<>2 0.0143 14731 48646 89427 shall<>be<>3 0.0075 2551 10449 10176 in<>the<>4 0.0074 6985 20407 89427 the<>the<>5 0.0067 2 89427 89427 I<>will<>6 0.0054 2044 23430 4922 said<>unto<>7 0.0052 1643 6088 8944 thou<>shalt<>8 0.0051 1024 5017 1627 I<>am<>9 0.0049 1255 23430 1430 of<>Israel<>10 0.0043 1701 48646 2578 out<>of<>11 0.0039 1953 4357 48646 children<>of<>12 0.0038 1403 1936 48646 son<>of<>13 0.0034 1418 2305 48646 thou<>hast<>14 0.0033 670 5017 1054 Van<>Helsing<>15 0.0031 305 305 305 I<>have<>15 0.0031 1575 23430 6484 the<>king<>16 0.0030 1662 89427 2695 and<>and<>17 0.0029 2 59758 59758 according<>to<>18 0.0027 734 799 30166 the<>children<>18 0.0027 1346 89427 1936 And<>he<>18 0.0027 1637 13476 15266 it<>was<>19 0.0026 1400 12324 11852 I<>had<>19 0.0026 1445 23430 6896 the<>land<>19 0.0026 1297 89427 1885 all<>the<>20 0.0025 2661 8710 89427 unto<>him<>20 0.0025 1126 8944 9642 had<>been<>21 0.0024 627 6896 1579 to<>pass<>21 0.0024 719 30166 906 into<>the<>22 0.0023 1565 3251 89427 of<>and<>23 0.0022 36 48646 59758 Lord<>GOD<>23 0.0022 290 1169 300 the<>son<>23 0.0022 1322 89427 2305 they<>were<>23 0.0022 921 10570 5349 came<>to<>23 0.0022 1080 3349 30166 the<>to<>23 0.0022 1 89427 30166 the<>house<>23 0.0022 1325 89427 2368 the<>earth<>24 0.0020 910 89427 1142 I<>could<>24 0.0020 782 23430 1969 he<>said<>24 0.0020 1022 15266 6088 unto<>them<>25 0.0019 968 8944 9649 could<>not<>25 0.0019 609 1969 10847 the<>people<>25 0.0019 1266 89427 2470 of<>of<>25 0.0019 9 48646 48646 to<>be<>25 0.0019 1564 30166 10176 Thus<>saith<>25 0.0019 305 514 1262 began<>to<>26 0.0018 542 736 30166 house<>of<>26 0.0018 978 2368 48646 saith<>the<>26 0.0018 891 1262 89427 a<>little<>26 0.0018 588 18937 1189 don<>t<>26 0.0018 219 219 667 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two word collocations for Poisson's AM are found as follows: % statistic.pl user2 test2.u2 test2.cnt The 50 ranks are as follows: --test2.u2-- //Poisson's AM output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354222 the<>the<>1 5888.7006 2 89427 89427 and<>and<>2 2621.8905 2 59758 59758 the<>to<>3 1984.4361 1 89427 30166 of<>and<>4 1966.1513 36 48646 59758 of<>of<>5 1693.0572 9 48646 48646 the<>I<>6 1539.8723 1 89427 23430 I<>the<>7 1269.7665 68 23430 89427 a<>the<>8 1243.3868 1 18937 89427 to<>and<>9 1078.9410 63 30166 59758 to<>of<>10 1076.6269 1 30166 48646 of<>to<>11 1048.2659 6 48646 30166 the<>he<>12 973.1853 6 89427 15266 Project<>Gutenberg<>13 910.1206 113 163 113 he<>the<>14 904.4241 22 15266 89427 his<>the<>15 807.4520 1 12329 89427 that<>and<>16 803.8059 19 20250 59758 I<>and<>17 803.2322 61 23430 59758 I<>of<>18 798.3679 8 23430 48646 of<>I<>19 764.5523 16 48646 23430 years<>old<>20 761.5457 161 752 965 Holy<>Ghost<>21 739.8175 91 152 91 lifted<>up<>22 716.3992 155 196 3976 the<>not<>23 709.7152 1 89427 10847 on<>board<>24 696.6052 157 4949 192 meat<>offering<>25 681.2251 122 322 730 Madam<>Mina<>26 667.7936 86 92 205 thus<>saith<>27 666.3556 139 467 1262 to<>to<>28 654.2245 3 30166 30166 of<>in<>29 650.7003 18 48646 20407 they<>the<>30 643.4693 11 10570 89427 in<>of<>31 636.3871 22 20407 48646 o<>clock<>32 626.1584 75 95 97 sin<>offering<>33 614.3277 118 455 730 it<>the<>34 582.5441 67 12324 89427 place<>where<>35 582.2297 147 1317 1092 thy<>servant<>36 581.0534 165 4529 554 the<>all<>37 568.8163 1 89427 8710 high<>places<>38 563.4935 98 545 295 thine<>hand<>39 541.2238 145 923 1935 of<>he<>40 536.4604 2 48646 15266 rose<>up<>41 535.2729 123 205 3976 my<>lord<>42 531.6304 154 8795 286 at<>least<>43 525.9582 130 4608 254 pass<>when<>44 523.7818 167 906 4141 burnt<>offerings<>45 520.0814 86 388 271 Dr<>Seward<>46 513.3245 64 139 79 Mrs<>Harker<>47 511.0650 65 102 128 Mock<>Turtle<>48 509.2428 56 56 59 thou<>mayest<>49 497.1772 109 5017 117 their<>fathers<>50 489.2345 153 5767 561 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We find that the collocations in the Poisson's AM output are more interesting than that of the TMI output. ex. Holy Ghost, Mock Turtle, Madam Mina, thy servant, my lord, high places, etc. However, the Poisson's AM top ranks have collocations like "the the" and "and and". This is due to the high value of 'lamda' which is E11. (b) Cutoff Point - - - - - - - The cutoff for TMI output is where rank 38 starts (221st bigram). There are only 44 ranks in the output, out of which the first 221 bigrams have 38 ranks (ranks change faster), whereas the rest of the huge number of bigrams have just 6 distinct ranks distributed between them.(ranks change slower). i.e. There is a clustering of scores. pass<>when<>38 0.0006 167 906 4141 There is no particular cutoff point for the Poisson's AM's output, but we can call the 1st rank as a cutoff point essentially because it's score is unusually high. (c) Rank Comparison - - - - - - - - The rank comparison matrix for the measures is as shown below. TMI USER2 LL -0.9046 0.9773 MI -0.9055 0.6924 We observe that the output of TMI is significantly different (rather opposite) than LL and MI. The Poisson's ratio is similar to log-likelihood [Krenn 00]. Hence, the high rank coefficient is justified. Poisson's AM and MI are strongly correlated but not exactly same. We also observe that though the rank coefficient of TMI and LL is inversely related, the top 30 bigrams are almost the same. (d) Overall Recommendation - - - - - - - - - - - - I found out that the Pointwise mutual information gives very good results. Most collocations (in terms of natural language, not statistically proved) would probably occur once and would occur together. Such usage should be highly ranked. For example: menace<>Monster<>1 20.3690 1 1 1 blending<>contradictions<>1 20.3690 1 1 1 fleeting<>diorama<>1 20.3690 1 1 1 Hence, I would recommend Pointwise mutual information despite its drawbacks. My judgment is based on the results of my experiments. Experiment 2 ------------ The corpus used for the second experiment is the same as that used for the first experiment. However, in this experiment we find the trigrams (rather 3-word collocations) in the corpus. The data is preprocessed using the same token file to remove the non-alpha characters. Note: Before you run the following command you will probably need to change ----- limits for maximum file size and/or cpu time. This can be done by ulimit -f and ulimit -t respectively. Specify any high arbitrary value or make it unlimited. % count.pl --ngram 3 --token token.txt test3.cnt /home/cs/bhoi0001/CS8761/nsp/ The top 50 joint frequency rankings are shown below. --[test3.cnt]-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 of<>the<>LORD<>1626 48646 89427 6653 14731 1630 5964 the<>son<>of<>1300 89427 2305 48646 1322 25368 1418 the<>children<>of<>1263 89427 1936 48646 1346 25368 1403 out<>of<>the<>979 4357 48646 89427 1953 1290 14731 the<>house<>of<>901 89427 2368 48646 1325 25368 978 children<>of<>Israel<>650 1936 48646 2578 1403 650 1701 the<>land<>of<>618 89427 1885 48646 1297 25368 646 saith<>the<>LORD<>615 1262 89427 6653 891 615 5964 the<>sons<>of<>504 89427 1110 48646 517 25368 598 unto<>the<>LORD<>488 8944 89427 6653 2020 489 5964 the<>LORD<>and<>484 89427 6653 59758 5964 8114 520 it<>came<>to<>464 12324 3349 30166 501 816 1080 came<>to<>pass<>462 3349 30166 906 1080 462 719 and<>all<>the<>461 59758 8710 89427 1037 3875 2661 and<>I<>will<>456 59758 23430 4922 1700 633 2044 said<>unto<>him<>454 6088 8944 9642 1643 504 1126 And<>he<>said<>452 13476 15266 6088 1637 1202 1022 the<>king<>of<>427 89427 2695 48646 1662 25368 851 And<>the<>LORD<>408 13476 89427 6653 2249 409 5964 the<>hand<>of<>406 89427 1935 48646 453 25368 449 of<>the<>house<>393 48646 89427 2368 14731 463 1325 And<>it<>came<>384 13476 12324 3349 622 533 501 of<>the<>children<>364 48646 89427 1936 14731 401 1346 said<>unto<>them<>364 6088 8944 9649 1643 383 968 the<>midst<>of<>334 89427 384 48646 383 25368 334 the<>word<>of<>331 89427 952 48646 448 25368 375 the<>name<>of<>329 89427 1105 48646 350 25368 340 it<>shall<>be<>326 12324 10449 10176 614 667 2551 in<>the<>land<>325 20407 89427 1885 6985 381 1297 according<>to<>the<>323 799 30166 89427 734 341 3830 and<>they<>shall<>320 59758 10570 10449 1337 1217 894 of<>the<>land<>319 48646 89427 1885 14731 376 1297 and<>said<>unto<>314 59758 6088 8944 1043 641 1643 of<>the<>earth<>308 48646 89427 1142 14731 314 910 LORD<>thy<>God<>296 6653 4529 4614 304 675 351 the<>LORD<>thy<>296 89427 6653 4529 5964 319 304 Thus<>saith<>the<>291 514 1262 89427 305 313 891 the<>king<>s<>288 89427 2695 3699 1662 977 300 all<>the<>people<>287 8710 89427 2470 2661 341 1266 house<>of<>the<>287 2368 48646 89427 978 436 14731 and<>in<>the<>286 59758 20407 89427 839 3875 6985 he<>said<>unto<>283 15266 6088 8944 1022 466 1643 the<>LORD<>And<>277 89427 6653 13476 5964 1943 281 one<>of<>the<>270 3676 48646 89427 651 394 14731 word<>of<>the<>268 952 48646 89427 375 310 14731 in<>the<>midst<>267 20407 89427 384 6985 268 383 before<>the<>LORD<>265 2676 89427 6653 742 265 5964 the<>Lord<>GOD<>262 89427 1169 300 731 264 290 LORD<>of<>hosts<>244 6653 48646 300 263 244 285 the<>LORD<>of<>240 89427 6653 48646 5964 25368 263 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We cannot say by looking at the results of the frequency rankings that the collocations found are not significant. I have implemented the log-likelihood AM and Poisson's AM for trigrams. The description of their implementations can be found in their respective source codes. In the subsequent sections, we note the results of these two AMs on the corpus and compare and analyze the findings. (a) Top 50 Comparison - - - - - - - - - The 3-word collocations are found using log likelihood for trigrams as follows: % statistic.pl --ngram 3 ll3 test3.ll3 test3.cnt The top 50 3-word collocations found using log likelihood AM are as shown: --[test3.ll3]-- // Log likelihood output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 the<>LORD<>of<>1 111362.6616 240 89427 6653 48646 5964 25368 263 the<>children<>of<>2 88547.8405 1263 89427 1936 48646 1346 25368 1403 the<>king<>of<>3 88004.4561 427 89427 2695 48646 1662 25368 851 the<>son<>of<>4 87925.7509 1300 89427 2305 48646 1322 25368 1418 the<>house<>of<>5 85487.1624 901 89427 2368 48646 1325 25368 978 the<>land<>of<>6 85410.9035 618 89427 1885 48646 1297 25368 646 the<>earth<>of<>7 84784.2917 3 89427 1142 48646 910 25368 3 the<>people<>of<>8 84239.8952 139 89427 2470 48646 1266 25368 175 the<>same<>of<>9 83991.7332 1 89427 723 48646 662 25368 1 the<>Lord<>of<>10 83229.8499 15 89427 1169 48646 731 25368 25 the<>sons<>of<>11 83141.9206 504 89427 1110 48646 517 25368 598 the<>midst<>of<>12 83020.0501 334 89427 384 48646 383 25368 334 the<>world<>of<>13 82577.2817 7 89427 595 48646 468 25368 16 the<>one<>of<>14 82412.0136 4 89427 3676 48646 141 25368 651 the<>city<>of<>15 82330.9315 115 89427 977 48646 575 25368 158 the<>part<>of<>16 82258.6409 16 89427 574 48646 20 25368 341 the<>sea<>of<>17 82238.1683 20 89427 723 48646 474 25368 28 the<>priest<>of<>18 82170.8559 9 89427 576 48646 414 25368 18 the<>other<>of<>19 82114.2863 5 89427 1397 48646 587 25368 20 the<>first<>of<>20 82101.3345 14 89427 1185 48646 555 25368 29 +the<>word<>of<>21 82025.5217 331 89427 952 48646 448 25368 375 the<>congregation<>of<>22 81959.0340 73 89427 364 48646 328 25368 82 the<>ship<>of<>23 81938.7633 3 89427 617 48646 390 25368 9 the<>tabernacle<>of<>24 81811.1994 174 89427 327 48646 286 25368 175 the<>door<>of<>25 81774.1339 108 89427 538 48646 375 25368 116 the<>hand<>of<>26 81690.8952 406 89427 1935 48646 453 25368 449 the<>end<>of<>27 81681.8431 147 89427 545 48646 260 25368 260 the<>name<>of<>28 81680.6279 329 89427 1105 48646 350 25368 340 the<>ground<>of<>29 81672.5320 2 89427 442 48646 304 25368 5 the<>Levites<>of<>30 81664.8878 3 89427 265 48646 242 25368 3 the<>wilderness<>of<>31 81662.8149 53 89427 314 48646 277 25368 54 the<>tribe<>of<>32 81648.1790 176 89427 242 48646 179 25368 207 the<>ark<>of<>33 81629.7452 139 89427 231 48646 219 25368 147 the<>inhabitants<>of<>34 81537.1688 157 89427 228 48646 187 25368 178 the<>Son<>of<>35 81468.1143 127 89427 293 48646 162 25368 203 the<>altar<>of<>36 81467.6858 58 89427 379 48646 277 25368 67 the<>way<>of<>37 81452.9843 169 89427 1463 48646 511 25368 229 the<>priests<>of<>38 81433.5951 11 89427 420 48646 272 25368 17 the<>whole<>of<>39 81424.3040 12 89427 478 48646 287 25368 16 the<>law<>of<>40 81419.8516 97 89427 588 48646 333 25368 111 the<>Jews<>of<>41 81417.7290 3 89427 253 48646 212 25368 3 the<>sword<>of<>42 81402.8199 24 89427 471 48646 288 25368 31 the<>morning<>of<>43 81366.9030 1 89427 505 48646 274 25368 2 the<>chief<>of<>44 81365.9421 70 89427 319 48646 236 25368 90 the<>field<>of<>45 81361.1708 27 89427 332 48646 243 25368 32 the<>top<>of<>46 81353.0009 132 89427 177 48646 163 25368 133 the<>wicked<>of<>47 81351.9657 5 89427 395 48646 249 25368 5 the<>rest<>of<>48 81347.0400 157 89427 587 48646 310 25368 164 the<>island<>of<>49 81314.2605 7 89427 301 48646 219 25368 10 the<>sight<>of<>50 81294.1331 187 89427 504 48646 207 25368 226 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 3-word collocations are found using Poisson's AM as follows: % statistic.pl --ngram 3 user3 test3.u3 test3.cnt The top 50 3-word collocations found using Poisson's AM are as shown: --test.u3-- //user3 measure output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1354215 of<>the<>LORD<>1 2955.2691 1626 48646 89427 6653 14731 1630 5964 the<>children<>of<>2 2915.5269 1263 89427 1936 48646 1346 25368 1403 the<>son<>of<>3 2906.3098 1300 89427 2305 48646 1322 25368 1418 children<>of<>Israel<>4 2437.1633 650 1936 48646 2578 1403 650 1701 saith<>the<>LORD<>5 1941.7497 615 1262 89427 6653 891 615 5964 came<>to<>pass<>6 1878.7467 462 3349 30166 906 1080 462 719 the<>house<>of<>7 1836.9888 901 89427 2368 48646 1325 25368 978 out<>of<>the<>8 1738.1924 979 4357 48646 89427 1953 1290 14731 said<>unto<>him<>9 1445.7324 454 6088 8944 9642 1643 504 1126 it<>came<>to<>10 1282.3006 464 12324 3349 30166 501 816 1080 And<>he<>said<>11 1241.8753 452 13476 15266 6088 1637 1202 1022 the<>land<>of<>12 1213.9892 618 89427 1885 48646 1297 25368 646 Thus<>saith<>the<>13 1182.4689 291 514 1262 89427 305 313 891 And<>it<>came<>14 1179.5944 384 13476 12324 3349 622 533 501 the<>Lord<>GOD<>15 1131.4398 262 89427 1169 300 731 264 290 said<>unto<>them<>16 1118.7897 364 6088 8944 9649 1643 383 968 LORD<>thy<>God<>17 1075.9444 296 6653 4529 4614 304 675 351 the<>sons<>of<>18 1072.1146 504 89427 1110 48646 517 25368 598 unto<>the<>LORD<>19 1006.5100 488 8944 89427 6653 2020 489 5964 LORD<>of<>hosts<>20 907.1557 244 6653 48646 300 263 244 285 land<>of<>Egypt<>21 897.9751 227 1885 48646 612 646 227 422 +and<>I<>will<>22 866.0937 456 59758 23430 4922 1700 633 2044 saith<>the<>Lord<>23 849.4530 239 1262 89427 1169 891 241 731 it<>shall<>be<>24 835.0506 326 12324 10449 10176 614 667 2551 the<>midst<>of<>25 819.0449 334 89427 384 48646 383 25368 334 the<>king<>s<>26 775.3358 288 89427 2695 3699 1662 977 300 he<>said<>unto<>27 769.2952 283 15266 6088 8944 1022 466 1643 And<>thou<>shalt<>28 760.0523 212 13476 5017 1627 244 215 1024 according<>to<>the<>29 745.5014 323 799 30166 89427 734 341 3830 in<>the<>midst<>30 740.8262 267 20407 89427 384 6985 268 383 And<>the<>LORD<>31 721.3574 408 13476 89427 6653 2249 409 5964 the<>hand<>of<>32 706.9379 406 89427 1935 48646 453 25368 449 the<>king<>of<>33 683.5401 427 89427 2695 48646 1662 25368 851 in<>the<>land<>34 675.1543 325 20407 89427 1885 6985 381 1297 all<>the<>people<>35 661.7601 287 8710 89427 2470 2661 341 1266 the<>word<>of<>36 659.9338 331 89427 952 48646 448 25368 375 I<>pray<>thee<>37 657.1515 162 23430 356 3926 224 409 172 I<>could<>not<>38 656.3701 229 23430 1969 10847 782 1434 609 and<>said<>unto<>39 655.6299 314 59758 6088 8944 1043 641 1643 of<>the<>house<>40 638.2263 393 48646 89427 2368 14731 463 1325 the<>LORD<>thy<>41 637.2239 296 89427 6653 4529 5964 319 304 the<>name<>of<>42 630.4330 329 89427 1105 48646 350 25368 340 before<>the<>LORD<>43 625.5478 265 2676 89427 6653 742 265 5964 of<>the<>children<>44 613.8381 364 48646 89427 1936 14731 401 1346 I<>don<>t<>45 604.2793 120 23430 219 667 120 256 219 come<>to<>pass<>46 601.0908 164 2653 30166 906 523 164 719 and<>thou<>shalt<>47 582.1768 206 59758 5017 1627 344 225 1024 to<>pass<>when<>48 576.4219 167 30166 906 4141 719 227 167 of<>the<>earth<>49 574.9629 308 48646 89427 1142 14731 314 910 answered<>and<>said<>50 568.6228 180 602 59758 6088 190 180 1043 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can see that "the _ of" trigrams topping the list of the log likelihood AM output. This trend continues for rank 1777 (trigram #2810).This is due to the high frequency of 'the' and 'of'. Since we only consider the ratio of joint frequency and expected value for it if the 3 words were independent, this trend does not necessarily reflect in the Poisson's AM output. We observe that Poisson's AM has found better 3-word collocations. ex. land of Egypt, children of Egypt, lord of hosts. We also observe that the top 50 frequency rankings and Poisson's AM rankings are somewhat similar. (b) Cutoff Point - - - - - - - There does exist a cutoff point in the LL3 output as pointed out earlier. The cutoff exists at rank 1777, before which the trigrams are of kind "the _ of" and after which the other trigrams occur. As said earlier, this is due to the high frequency of 'the' and 'of'. We could say that the LL3 is influenced by the individual frequencies of the words in the trigrams. As for the output of Poisson's AM, I cannot find any interesting patterns. We do not claim that there are no patterns in the output, but at least they are not readily discernible to the eye unaided by statistics or mining tools. I looked for clustering of ranks and word-type trigrams clustering, but could not find anything interesting. We could say that the individual frequencies (like high frequencies of 'the' and 'of') do not affect Poisson's output as much as they do affect LL3 output. (c) Rank Comparison - - - - - - - - The rank comparison of LL3 and Poisson's AM for trigrams was done as follows: % rank.pl test3.ll3 test3.u3 Rank correlation coefficient = -0.1813 The negative value suggests a reverse relation, however since the value is closer to zero, we can say that they are more of less unrelated. The log likelihood as we have seen in part (c) is more similar to the Poisson's test, but our results do not confirm this. This would probably be because we only consider joint frequency of the trigram in Poisson's test. But, the results are better than log-likelihood, at least in this case (and the several other I tried). (d) Overall Recommendation - - - - - - - - - - - - From the results of my experiments, I would say that the 3-word collocations found by the Poisson's AM are more significant than those found by log-likelihood AM. However, we do not have enough data or theory to support this statement. Our conclusions regarding the best measure for finding collocations are further limited by the fact that we have considered only two of the innumerous AMs.++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Natural Language Processing CS8761 Bridget Thomson McInnes 10 October 2002 EXPERIMENT 1: -------------------------------------------------------------------------------------------------------------------- This experiment required the implementation of two modules to be used in conjunction with the N-gram Statistic Package. The two modules that were implemented calculate the True Mutual Information and the Symmetric Lambda measure. The modules were named tmi.pm and user2.pm respectively. True Mutual Information calculates how much one random variable tell about another. The formula that was used is: I(X;Y) = Sum( P(x,y) * lg( P(x,y) / P(x)P(y) ) ) = H(X) - H(X,Y). This states that "the information about X tells about Y is the uncertainty in X plus the uncertainty about Y minus the uncertainty in both X and Y" [1]. This formula can be re-written to correspond to the information of a contingency table. Given a contingency table: word2 ! !word2 ______________________ | | | word1 | n11 | n12 | n1p | | | ---------------------- | | | !word1 | n21 | n22 | n2p | | | ---------------------- -------- np1 np2 | npp The formula is: I(X;Y) = Sum( nij/n++ * lg( nij/eij ) ) Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell. The calculation of True Mutual Information for the contingency table will give the information word1 tells about word2's uncertainty in word1 plus the uncertainty about word2 minus the uncertainty in both word1 and word2. Symmetric Lambda (Goodman-Kruskal Lambda) is interpreted as the average of the probable improvement in predicting the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage reduction in errors in predicting the dependent given knowledge of the independent" [3]. The formula used to calculate lambda is: lambda = ( rF + cF - Fr - Fr ) / ( (2 * N) - Fr - Fc ), where rF = Sum of the maximum frequency in each row cF = Sum of the maximum frequency in each column Fr = maximum marginal row value Fc = maximum marginal column value N = total bigram count Lambda is "the percent one reduces errors in guessing the value of the dependent variable when one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent variable are known for each value of the independent variable" [3]. "A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent variable helps estimate the value of the dependent variable. PLEASE NOTE: the experiment was run using the complete works of Shakespeare found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare TOP 50 COMPARISONS: -------------------------------------------------------------------------------------------------------------------- Which measure seems better at identifing significant or interesting collocations? The tmi module and the user2 module, which I will call the lambda module from here on out, returned extremely different results when comparing their bigrams. There is not a single overlap when comparing the top 50 bigrams of each modules output. Meaning combining the top 50 bigrams of each module there are 100 unique bigrams. The tmi module returned quite a few Proper Noun collocations, for example: KING<>RICHARD<> MARK<>ANTONY<> HENRY<>VI<> PRINCE<>HENRY<> DOMITIUS<>ENOBARBUS<> QUEEN<>MARGARET<> DON<>PEDRO<> I found this to be very good. I believe those are collocations that are desired. The lambda module did not return any Proper Noun collocations. The tmi module's output returned very few formal collocations, for example: to<>be<> my<>lord<> These are good collocations because they represent the type of text that was used. The text that was used for this experiment was the complete works of Shakespeare. The words "my lord" and "to be" tend to occur often together throughout Shakespeare's writings. But there were only a few of that type of collocation. The tmi module returned 29 out of the top 50 bigrams containing function words or punctuation while the lambda module return zero. These are bigrams with an expected high frequency. Although, function word collocations are not 'interesting' collocations, it is interesting to note that the lambda module did not return any. Lambda indicates the percentage of error reduction in guessing the value of the dependent variable when knowing the independent variable. This is because when a collocations has a lower frequency value the lambda will move closer to one while when a bigram has a higher frequency value the lambda moves closer to zero. Because of this, the lambda module's output did not return any interesting bigrams in the top 50 ranks. In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared. For example: rich<>hangings<> blessed<>wood<> sweet<>Exeter<> rude<>hands<> secretly<>open<> desert<>country<> widow<>sister<> These collocations were very interesting. Although no Proper Name collocations appeared in the first 50 bigrams of the lambda module's output, there were quite a few 'true' collocations that appeared. More true collocations appeared in the last 50 bigrams of the lambda output than appeared in the first 50 bigrams of the tmi module's output. How would you characterized the top 50 bigrams found by each module? I would characterize the tmi module's top 50 bigrams as consisting mostly of Proper Noun collocations and function word collocations. It was very good at returning Proper Noun collocations which is interesting. I would characterize the lambda module's top 50 bigrams as very poor but I would characterized the last 50 bigrams as consisting mostly of interesting bigrams - 'true' bigrams. Bigrams that would be considered significant. Is one of these measures significantly 'better' or 'worse' than the other? Why? Determining whether the True Mutual Information measure is better or worse than the Lambda Symmetric measure is difficult. The True Mutual Information measure returned many Proper Noun collocations which are considered very significant collocations. Proper Noun collocations are collocations that a measure to distinguish. But the Lambda Symmetric measure return collocations that where much more interesting, for example "desert country". This describes something in the about the text that is being analyzed. Because of this I would have to say that the Lambda Symmetric measure performs 'better' than the True Mutual Information measure. This opinion is based on the comparing the top 50 bigrams in the tmi module's output to the last 50 bigrams in the Lambda Symmetric module. CUTOFF POINT: -------------------------------------------------------------------------------------------------------------------- Are there 'natural' cutoff points for scores, where bigrams above this value appear to be interesting or significant while those below do not? The lambda module's output values consist mostly of one and zero. At line 807, in the lambda output file, the value becomes less than one. At line 23185, in the lambda output file, the value goes to zero. This means that 22378 out of 357114 bigrams have a value that is not equal to one or zero. The group of bigrams that have a score of zero have a lot of interesting bigrams, for example: weary<>bones<> country<>wench<> While the group of bigrams that have a score of one on a whole do not have interesting bigrams, for example: UNIMPROVED<>unreproved<> noster<>Henricus<> Although, while getting examples of uninteresting bigrams with a score of one I found: helter<>skelter<>1 1.0000 2 2 2 Which, of course, is very interesting. I believe this shows that, significant bigrams are not confined to bigrams with a lambda score of zero but significant bigrams seem to occur more often when the score is one. As the lambda score grows closer to one the number of significant bigrams seems to decrease and likewise, as the lambda score moves closer to zero the number of significant bigrams increase. The highest score for True Mutual Informations is .0083 while the most common score greater than zero is .0001. Significant collocations seems to be dispersed throughout the table. For example: cunning<>hounds<>41 0.0000 2 155 66 chamber<>window<>40 0.0001 24 287 92 But Proper Name collocations have a rank of .0005 and higher which is a significant cutoff. The majority of the scores fall below .0002. From line 1345 and on, the scores have a value of .0001 or lower. This means that 355799 out of 357144 bigrams are .0001 or less. Any collocations with a score of .0001 or lower is going to be lost. RANK COMPARISON: -------------------------------------------------------------------------------------------------------------------- Comparison between ll.pm and tmi.pm. %csdev0540% perl rank.pl ll.txt tmi.txt %Rank correlation coefficient = -0.9268 Since the Rank correlation coefficient is close to -1 we can say that Loglikelihood Ratio and True Mutual Information are almost exactly opposite of each other Comparison between ll.pm and user2.pm. %csdev0542% perl rank.pl ll.txt user2.txt %Rank correlation coefficient = -0.6381 Again the Rank correlation coefficient is closer to -1 than 0. Therefore, we can say that the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We should notice though that the Symmetric Lambda measure is closer to the Loglikelihood Ratio than True Mutual Information is. Comparison between mi.pm and tmi.pm. %csdev0543% perl rank.pl mi.txt tmi.txt %Rank correlation coefficient = -0.9273 The Rank correlation coefficient is close to -1. Therefore, we can say that Pointwise Mutual Information and True Mutual Information are opposite of each other. They are 'more opposite' of each other than LogLikelihood and True Mutual Information. Comparison between mi.pm and user2.pm. %csdev0544% perl rank.pl mi.txt user2.txt %Rank correlation coefficient = -0.6364 The Rank correlation coefficient is closer to -1 than 0. Therefore, we can say that the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We should notice though that the Symmetric Lambda measure is closer to Pointwise Mutual I Information than True Mutual Information is. OVERALL RECOMMENDATION: -------------------------------------------------------------------------------------------------------------------- Which is the best measure for identifying significant collocations in large corpora: mi.pm, ll.pm, user2.pm or tmi.pm? I think that the best measure for identifying significant collocations in large corpora is the ll.pm module. The ll.pm module seems to rank interesting collocations higher. For example, looking at a few randomly selected collocations and their line numbers. Proper Noun Collocations QUEEN<>GERTRUDE<> ll.pl - 162 mi.pl - 23638 tmi.pl - 147 user2.pl - 2147 KING<>EDWARD<> ll.pl - 147 mi.pl - 50771 tmi.pl - 174 user2.pl - 5233 MARK<>ANTONY<> ll.pl - 16 mi.pl - 19732 tmi.pl - 18 user2.pl - 826 Significant Collocations your<>highness<> ll.pl - 183 mi.pl - 69966 tmi.pl - 204 user2.pl - 5943 pray<>you<> ll.pl - 86 mi.pl - 119610 tmi.pl - 90 user2.pl - 290548 Sir<>John<> ll.pl - 89 mi.pl - 34664 tmi.pl - 94 user2.pl - 5162 my<>heart<> ll.pl - 114 mi.pl - 143689 tmi.pl - 111 user2.pl - 32081 We can see from the above examples that the ll.pm module gives interesting collocations a higher value. I chose to look at the line number to justify why I picked Loglikelihood because there are the same number of bigrams in each file and the line number gives us an idea of where they fall on each of the different measures ranking system. It must be noted that the user2.pl module ranks better collocations closer to the bottom than the top. Yet, even keeping this in mind the ll.pm module identifies the significant collocations better. EXPERIMENT 2: -------------------------------------------------------------------------------------------------------------------- This experiment required the implementation of two modules to be used in conjunction with the N-gram Statistic Package using trigrams. The two modules that were implemented calculate the Loglikelihood Coefficient and the Symmetric Lambda measure. The modules were named ll3.pm and user3.pm respectively. Log Likelihood coefficient determines how much more likely one hypothesis is from another. The formula that is used is: G^2 = Sum( (2 * nij) * log( nij / eij) ) This states how much more likely the known values are from the expected values of the trigram. The formula above corresponds to the following contingency table: | w3 | !w3 | ____|__________|__________| | | | | | w2| n11 | n12 | n1p w1 |___|__________|__________|___ | | | | |!w2| n21 | n22 | n2p ____|___|__________|__________|___ | | | | | w2| n31 | n32 | n3p !w1 |___|__________|__________|___ | | | | |!w2| n41 | n42 | n4p |___|__________|__________| _____ np1 | np2 | npp Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell. The calculation of the Loglikelihood Coefficient for the contingency table will give the known probability of the words compared to the expected probability of the words. Symmetric Lambda (Goodman-Kruskal Lambda) is interpreted as the average of the probable improvement in predicting the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage reduction in errors in predicting the dependent given knowledge of the independent" [3]. The formula used to calculate lambda is: lambda = ( rF + cF - Fr - Fr ) / ( (2 * N) - Fr - Fc ), where rF = Sum of the maximum frequency in each row cF = Sum of the maximum frequency in each column Fr = maximum marginal row value Fc = maximum marginal column value N = total trigram count Lambda is "the percent one reduces errors in guessing the value of the dependent variable when one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent variable are known for each value of the independent variable" [3]. "A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent variable helps estimate the value of the dependent variable. PLEASE NOTE: the experiment was run using the complete works of Shakespeare found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare TOP 50 COMPARISONS: -------------------------------------------------------------------------------------------------------------------- Which measure seems better at identifying significant or interesting collocations? The ll3.pm module and the user3.pm module, which I will call the loglikelihood module and the lambda module respectively from here on out, returned extremely different results when comparing them. There is not a single overlap when comparing the top 50 trigrams of each modules output. Meaning combining the top 50 trigrams of each module there are 100 unique trigrams. The loglikelihood module returned trigrams that are similar to the below examples: the<>Duke<>of<> 2 20253.8357 178 51427 502 33153 188 7915 424 the<>name<>of<> 3 19756.8674 121 51427 1383 33153 153 7915 176 the<>Earl<>of<> 4 19304.0744 46 51427 166 33153 46 7915 156 the<>house<>of<>5 19223.6022 72 51427 1159 33153 202 7915 92 49 out of the top 50 trigrams consist of similar trigrams. The only trigram of the 50 that did not look similar to the examples above was: ",<>my<>lord<>1 26939.4642 1938 181185 22593 4425 4400 2068 2532". This trigram was not very interesting either. Lambda indicates the percentage of error reduction in guessing the value of the dependent variable when knowing the independent variable. This is because when a collocations has a lower frequency value the lambda will move closer to one while when a collocation has a higher frequency value the lambda moves closer to zero. Because of this, the lambda module's output did not return any interesting collocations in the top 50 ranks. In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared for bigrams. I expected this to be true for trigrams also but it was not. The last 50 trigrams contained punctuation. For example: FORMAL<>regular<>,<> Throca<>movousus<>,<> Although even with punctuation, there were interesting trigrams. For example: rigorously<>effused<>,<> Perfect<>chrysolite<>,<> affectedly<>Enswathed<>,<> But as seen, the trigrams end with a punctuation. How would you characterized the top 50 trigrams found by each module? I would characterize the loglikelihood module's top 50 trigrams as consisting mostly of uninteresting trigrams. The trigrams all consist of "the of". I would characterize the lambda module's top 50 trigrams as very poor also. Is one of these measures significantly 'better' or 'worse' than the other? Why? I don't think one of the models is better or worse than the other. Neither return outstanding collocations. Even taking into consideration that the lambda module ranks better collocations closer to the bottom than the top, the collocations are not note worthy. CUTOFF POINT: -------------------------------------------------------------------------------------------------------------------- Are there 'natural' cutoff points for scores, where trigrams above this value appear to be interesting or significant while those below do not? The loglikelihood module's trigrams that consist of "the of" end at rank 586. They then continue on with "I not" trigrams, then move to "The of" trigrams. Not very interesting. The trigrams following follow a similar pattern. The trigrams come in blocks. Each block looking similar. For example, the "I you" block and the ", ." block. Some trigrams seem to look like small sentences more than collocations. For example, "YOU<>LIKE<>IT<>". Other trigrams look like prepositional phrases. For example, "more<>mirth<>than<>" and "more<>faithful<>than<>" which may be interesting but I don't believe that they are collocations. This may have to do with the data that I am using. Most sentences in Shakespeare's works are short. The lambda module's trigrams do not have a natural cut off either. The top 100 trigrams were not very good collocations: For example: ,<>these<>encounterers<> mischievous<>UNHATCHED<>undisclosed<> The last 100 trigrams were not very good either which is not what I expected. They consisted of punctuation. For example: .<>CXLV<>.<> ,<>Tombless<>, Not very useful nor interesting. One other point to make about the lambda module. For trigrams, I came up with negative values. This measure is suppose to be available to use for trigram data. I am not certain why the negative values showed. There does not seem to be anything wrong with the implementation. I did not get negative values until I ran my program with a very large corpora. The data worked perfectly on smaller data sets. The results stayed between 1 and -1. The trigrams that recieved a score of -1 were similar to the trigrams that recieved a score of 1. Both sets contained trigrams with punctuation. I did not read in any of the literature that the absolute value of the score must be taken but this is one possiblity that I looked into. OVERALL RECOMMENDATION: -------------------------------------------------------------------------------------------------------------------- Which is the best measure for identifying significant collocations in large corpora: ll3.pm or user3.pm? In my opinion, ll3.pm is the best measure to use. Neither measure identified interesting or significant collocations. This may have to do with the data that I was experimenting with but I did not see any grouping of good collocations. They seemed to be dispersed throughout the tables. I believe that ll3.pm is the better measure due to the negative numbers I was getting for the user3.pm module. SOURCES: ------------------------------------------------------------------------------------------------------------------- [1] www.engineering.usu.edu/classes/ece/7680/lecture2/node3.html [2] www.zi.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap28/sect20.htm [3] ww2.chass.ncsu.edu/garson/pa765/assocnominal.htm ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Suchitra Goopy 10/11/2002 Natural Language Processing ------------------------------------------------------------------------------ Assignment 2 The Great Collocation Hunt ------------------------------------------------------------------------------ INTRODUCTION: ------------ "A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things" -Foundations of Statistical Natural Language Processing (Christopher D.Manning and Hinrich Schutze) This experiment was performed with the aim of finding a test of association that can be implemented both for bigrams and trigrams.The nsp package was implemented by "Satanjeev Banerjee" and has many interesting measures of association for bigrams.Though nsp has been extended to accomodate ngrams, a suitable measure of association(for nograms) is yet to be implemented. The test of association that I have used in this assignment is the "Goodman and Kruskal's Gamma".This test was taken from "Handbook of Parametric and NonParametric Statistical Procedures" by David J.Sheskin.A test of association is usually performed to see if the variables involved in the test are dependent on each other,or are independent.Many tests can also be used to determine how strong or how weak the association between the two variables is. Though I learnt that the "Goodman and Kruskal's Gamma" can be extended to accomodate trigrams,I have not found any supporting documents or references for this.But I have extended it to perform for trigrams as well. "The Goodman and Kruskal's Gamma is a signed test that measures both the strength and direction of the relationship,the sign indicating whether a relationship is positive or negative" - Taken from a website,the url is given below http://216.239.53.100/search?q=cache:VBkDxI8tR70C:www.bus.ed.ac.uk/cfm/cfmr988.pdf+extending+goodman+and+kruskal%27s+gamma++for+3+variables&hl=en&ie=UTF-8 EXPERIMENT 1: ------------- The two main modules that had to be implemented for performing Experiment 1 were tmi.pm and user2.pm.The module "tmi.pm" was implemented to calculate "True Mutual Information",while user2.pm was implemented for bigrams using the Goodman and Kruskal's Gamma. TOP 50 COMPARISONS ------------------ First a comparison was made between the outputs of tmi.pm and user2.pm I found that user2 had more interesting collocations than tmi.Though there were a number of collocations in the top 50 of user2 that did not make much sense,a few of the interesting collocations I found were : picked<>up<>1 1.0000 1 1 385 stark<>naked<>1 1.0000 2 2 15 complied<>with<>1 1.0000 1 1 1071 split<>open<>5 0.9996 1 2 32 laboured<>hard<>15 0.9986 1 3 51 banished<>from<>32 0.9969 2 3 443 The above words usually tend to occur together and were also highly ranked by user2.If the value is closer to 1 then it indicates a very strong positive relationship.As we went down the ranks some of the following collocations I found were: shore<>would<>6397 0.0395 1 269 482 water<>a<>6842 -0.0971 2 148 2296 cannibal<>coast<>19 0.9982 1 4 43 enough<>in<>6940 -0.1348 1 98 1871 These were not good collocations and were ranked low by user2.Also some of them have a negative sign that indicates that the words have a negative relationship with each other. Tmi on the other hand had ranked the following to be very high.Though "of the" tends to occur together very often, or rather most of the time,it does not really form a collocation,we do not have enough evidence to say it forms a collocation.(It does not make much sense by itself) of<>the<>4 0.0086 817 3558 5854 Some of the good collocations in tmi were ranked down the list.Like the ones given below: cried<>out<>48 0.0007 13 18 423 cast<>away<>49 0.0006 12 32 147 rainy<>season<>50 0.0005 7 14 31 I feel that "dice.pm" would have performed better than the above two tests. This is because it has more information when it ranks the bigrams and hence does a better job. CUTOFF POINT ----------- On looking at the output for tmi I did not find any cut-off point above which the collocations were interesting.There were some vague and funny collocations even in the first few ranks.Sometimes I found some interesting collocations towards the middle,say around the 50th or 60th ranks. The output for user2 had good collocations in the first few ranks.Though I looked for a cut-off point I was not able to really find one.It was true that as the ranks decreased,the interesting collocations decreased as well, but just as I thought that I had found a cut-off point I would see a few good collocations,and would have to rethink the cut-off point. RANK COMPARISON --------------- I used rank.pl to compare the output produced by ll.pm and tmi.pm rank.pl ll.out tmi.out I got a rank correlation coefficient of -0.0338 This indicates that the two rankings are reversed. I compared ll.pm and user2. rank.pl ll.out user2.out 0.7117 This shows that they are ranked in almost the same way. rank.pl mi.out tmi.out -0.1136 A comparison between mi and tmi shows that they are reverse ranked. mi.out user2.out 0.9229 This shows that mi and user2 are ranked in almost the same way. Each measure is different in that they extract or rather,capture different elements.The tests of association are varied and differ in what they consider, to perform their calculatations or carry out the test.For example,if a measure uses frequency values in it's test then prepositions will occur more frequently and hence will be ranked higher,but that does not mean that they form collocations. OVERALL RECOMMENDATION ---------------------- On comparing the 4 modules I found that mi.pm had very good results.Many collocations were interesting,like the ones below pro<>cessing<>1 17.0977 1 1 1 tax<>deductible<>2 16.0977 1 2 1 United<>States<>2 16.0977 2 2 2 After mi.pm I would think that user2.pm performed very well.This is because some good collocations were ranked very high,and as the rank decreased most of the bigrams did not make much sense. Then tmi.pm performed well compared to ll.pm.It seemed to give higher ranks to better collocations.I came to this conclusion beacuse ll.pm did not seem to do the ranking as effective as tmi.pm.For instance,here is an example where the same bigram was ranked differently.I noticed this trend for a number of them ll.pm : cast<>away<>335 123.3642 12 32 147 ll tmi.pm :cast<>away<>49 0.0006 12 32 147 ---------------------------------------------------------------------------- EXPERIMENT 2: ------------ The two main modules that had to be implemented for experiment 2 were ll3.pm and user3.pm.ll3.pm was the Log Likelihood Ratio for trigrams. user3.pm was the same test of association used for bigrams,but had to be extended for trigrams. TOP 50 COMPARISONS ------------------ When I compared the output produced by ll3.pm and user3.pm,I noticed that most of the trigrams(atleast in ll3) had atleast one or two punctuation marks like , or ; .Then I tried to eliminate the formation of these trigrams by using the --stop*** option provided in nsp.But this option allows you to remove the trigram only if it is made up entirely of all the stop words in the STOP file.This did pose some problems when trying to find and analyse the interesting trigrams.Probably,if I had been able to eliminate the punctuation in the trigrams,more meaningful collocations could have been formed. user3.pm provided more interesting trigrams than ll3.pm.As said above ll3.pm had a number of trigrams with punctuation marks in it in the first 100's of ranks.The following were some of the interesting trigrams that I noticed in the output produced by user3.pm services<>for<>which<>1 1.0000 1 1 1320 889 1 1 8 Thank<>God<>!<>1 1.0000 1 1 159 92 1 1 3 uneven<>state<>of<>1 1.0000 1 1 22 3558 1 1 13 circumstance<>presented<>itself<>13 0.9988 1 3 12 31 1 1 2 Please<>note<>:<>50 0.9951 1 2 6 337 1 1 1 ll3.pm on the other hand had a lot of punctuation marks in the trigrams, and most of the times it wasnt a good collocation.Though some words tend to occur together as they did,they do not necessarily form collocations. CUTOFF POINT ----------- I did not find any interesting cut-off point above which there were unusual or good collocations.There were a number of good collocations in the beginning and later too you could find some collocations that were really good, sometimes better than the higher ranked ones.I found this to be true for both ll3.pm and user3.pm RANK COMPARISON --------------- I used rank.pl to compare ll3.pm and user3.pm.The rank correlation coefficient I obtained was ll3.out user3.out -0.2018 This shows that the two tests have a completely reversed ranking. The two tests differ in that they appear to capture different elements and rank them accordingly.Log likelihood is a very popular test of measure when compared to "Goodman and Kruskal's Gamma".But in many ways I felt that the performance of user3 was much better. OVERALL RECOMMENDATION: ----------------------- Though I have not found proof to support the test of association in user3.pm I would think that it is working fairly well.I found some trigams that formed very good collocations.Also I feel that if there was a mechanism to eliminate the punctuation marks from occuring in the trigrams a better analysis of the experiment could have been carried out. CONCLUSION ---------- I did notice one thing in the experiments.A strong association between the words does not necessarily mean they form a collocation.They could have a high ranking because they probably appeared a number of times in the text and hence were put together,but they would definitely not form a collocation. This experiment also helped me to perform a lot of research in trying to find a test of association.I also spent a considerable time thinking about the trigrams and bigrams and trying to see if they formed meaningful collocations. ***The stop option was provided by creating a file STOP.txt This File had ; , etc each on a different line Then the --stop option was used on the command line when executing the program++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Paul Gordon CS8761 1913768 10/10/02 Assignment 2 -- The great Collocation Hunt Introduction: The great collocation hunt is an experiment in using statistical measures of association to rank and categorize collocations. The intention is to rank the most "interesting" collocations first, followed by more mundane, or less interesting collocations. In this experiment, four tests of association were used with bigrams, and the results compared with each other. Two of these tests, are part of the nsp package, mi.pm, which measures pointwise mutual information, and ll.pm, which measures the log-likelihood. Two of the measures were implemented by the experimenter, tmi.pm, which measures true mutual information, and user2.pm, which measures the odds ratio. Also, two tests of association were used with trigrams, ll3.pm, which measures log-likelihood, and user3.pm, which measures the odds ratio for trigrams. The Corpus used for the experiments was the combination of several Project Gutenberg files of the following texts: Crime and Punishment, Of Human Bondage, Moby Dick, and The complete King James Bible. BIGRAMS: user2.pm, tmi.pm, mi.pm and ll.pm TOP 50 COMPARISON The original implementation of True Mutual Information (tmi.pm) had only 49 ranks for a corpus of nearly 2 million tokens. This was a problem. Apparently TMI can become quite small with large corpora. As a result, I decided to multiply my result by a constant factor. This wouldn't affect the ordering, other than to make it a finer grained measure. The alternative, modifying the number of significant figures reported by statistic.pl was not an option. User2.pm, which measures the Odds Ratio, did well at finding proper nouns. It is not until number 21 with "categorical imperative", that the collocation is something other than a proper noun ( although "PUBLIC DOMAIN" is not really a proper noun ). The top ten result follow: Pulcheria<>Alexandrovna<>1 894532483.0000 123 123 123 Avdotya<>Romanovna<>2 836590755.0000 115 115 115 Moby<>Dick<>3 633790675.0000 87 87 87 Svidriga<>lov<>4 214698175.0000 207 210 207 Marfa<>Petrovna<>5 189534429.6667 78 78 79 Nikodim<>Fomitch<>6 177467563.0000 24 24 24 PROJECT<>GUTENBERG<>7 119519499.0000 16 16 16 El<>Greco<>8 97788843.0000 13 13 13 Praskovya<>Pavlovna<>9 68814523.0000 9 9 9 Sag<>Harbor<>10 61570923.0000 8 8 8 This particular measure scores high when the bigram parts are only associated with that bigram -- notice for the top three, the first and last words never occur as first and last words in any other bigrams. Just beyond "categorical imperative there are several interesting collocation, including "hoky poky", "Pell mell" and "web browser". In contrast, there are virtually no proper nouns in the top 50 of the true mutual information measure, Lord and Israel being the exceptions. The top ten results from the mutual information measure are shown below. It appears that in order to rank high, not only does the bigram have to appear often, but each individual word has to appear often as well. This makes sense considering that tmi is summed over all squares of the contingency table. Notice that the tmi is much larger than it should be. This is because of the multiplying factor mentioned above. To get the actual value, divide by 1000000000. ,<>and<>1 43006180.2526 32702 119732 59360 of<>the<>2 13431457.2323 15222 50674 94305 the<>LORD<>3 12450162.9418 5962 94305 6648 .<>He<>4 9075845.1419 4181 65625 5246 in<>the<>5 7285833.3391 7752 23359 94305 shall<>be<>6 6198847.1158 2526 10154 10124 ;<>and<>7 5242192.1121 4755 17562 59360 don<>t<>8 5106985.6083 972 973 2957 I<>will<>9 4941218.7702 2055 18788 4891 I<>am<>10 4360132.2218 1323 18788 1537 CUTOFF POINT There does not appear to be any good cutoff point for the true mutual information measure. But then, that is because so many of the top ranking collocations don't appear to be very interesting, just common. There are many connecting words in the collection. And there isn't much to distinguish the underlying corpus. There appears to be a cutoff area, rather than a cutoff point for the odds ratio. All bigrams whose constituents are only a component of that bigram, and which only appears once ( 1, 1, 1 ) are ranked 23. Between there and rank 40, the bigrams appear to go from interesting to unusual. At rank 40, one of the bigrams is "beneficent publicity". This strikes me more as an unusual phrase rather than a collocation. There are many of these by rank 40. This measure picks up bigrams whose components appear together a lot relative to the amount they appear within other bigrams. So at first the measure picks of interesting pairs, but further down the rank, they lose their interesting qualities. RANK COMPARISON rank ll to user2 : 0.6858 rank ll to tmi : 0.9984 rank mi to user2 : 0.9949 rank mi to tmi : 0.7049 True mutual information is more like the log-likelihood measure and user2 is more like pointwise mutual information. log-likelihood and true mutual information probably have a high correlation because they have a similar method of summing logs over each square of the contingency table. It is less clear to me why there is such a high correlation between the pointwise mutual information, and the odds ratio. I cannot see a relationship between the two fractions used in these measures. What is more, the pointwise mutual information calculation take the log of the fraction. In general, though, neither measure sums over each square of the contingency table, and so should be more like each other than they are like the other two measures. OVERALL RECOMMENDATION Based on the above experiments, it appears that the odds ratio does a much better job of giving interesting collocations a high rank. However, it also seems clear that the odds ratio misses interesting collocations with common words. As the ranking results suggest, the log-likelihood measure is very similar to that of true mutual information. In fact, many of the same bigrams appear in the top 50 of each. As such, though, it suffers a similar problem, and that is that the "interesting" collocations are buried under the common ones. Pointwise mutual information is not as closely similar to the odds ratio, as tmi is to the log-likelihood. It is, rather, similar in style. It also seems to rank highly, bigrams whose components don't make up other bigrams, but it is not as extreme as the odds ratio. After examining the ranking each of these measures gave the bigrams in the corpus, I consider the pointwise mutual information to have the best list of interesting bigrams. TRIGRAMS: user3.pm and ll3.pm TOP 50 COMPARISON As with the bigram form of the odds ratio, the trigram model picks up a lot of names. Now, however, it is attaching whatever the preceding word is with it. So nearly the whole top fifty words are one of two names preceded with another word. Obviously this is not very useful. Regarding the trigram log-likelihood, it is also much as it was in the bigram model. It ranks highly very common token combinations like "and the ,". Although the trigrams definitely give more of a sense of the underlying text, especially from the bibles contribution. It turns out that the bible was an unfortunate choice for inclusion, because of the chapter-verse markers which were noticeable with the bigrams, but are now very pronounced. CUTOFF POINT Again there did not appear to be any clear cutoff for the log-likelihood measure. Indeed, it seemed as though the quality of the bigrams improved with depth. But even at ranks above 4,000, they don't appear to me to be very interesting. The Odds ratio statistic ranking has a somewhat strange cutoff. After about rank 500, the names start to drop off, and interesting collocations begin to appear. I suppose this is a result of the this measures affinity to rank highly, proper nouns. OVERALL RECOMMENDATION Based on the results of both trigram statistics, my overall recommendation is for the odds ratio measure. It, in my estimation, has many more interesting trigrams. It would be especially good if there was a way to filter out the repetition of proper nouns. To me, the log likelihood ranks seem just as uninteresting at the beginning, as further down. Actually, somewhere in the middle, there appear to be what I would call interesting bigrams. For example the following is take from around the 400,000 rank is<>life<>;<>404855 189.2558 1 10385 1067 17562 11 211 54 these<>cases<>,<>404856 189.2550 3 1602 64 119732 4 229 23 were<>consumed<>with<>404857 189.2540 1 5305 108 12113 7 118 11 the<>letter<>suddenly<>404858 189.2532 1 94305 228 446 70 3 1 eyes<>were<>somehow<>404859 189.2528 1 1203 5305 83 49 1 4 while<>I<>live<>404860 189.2527 1 676 18788 441 37 2 43 near<>here<>,<>404861 189.2510 2 350 883 119732 3 30 177 be<>set<>for<>404862 189.2469 2 10124 912 12320 23 193 8 justice<>take<>hold<>404863 189.245 "while I live" and "justice take hold" seem to me to be good candidates for "interesting" trigrams. CONCLUDING THOUGHTS Conducting the experiments for this assignment has gotten me to think more deeply about what a makes a bigram a collocation. In one sense, a collocation is a bigram where the likelihood of the two words occurring together is less than what is actually observed. That is, there is some statistical significance. But part of this assignment was to assess how well the measures did, which is a subject criteria. In the Manning and Schutze chapter on collocations, they were interested in fairly common words such as "strong tea" and "powerful computers." Yet the results I got from the statistical tests seemed to rank high either very general word combinations like "is a" or fairly uncommon pairs like "El Greco." ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Name: Prashant Jain CS8761 Assignment #2:The Great Collocation Hunt Introduction ------------- This assignment involved finding a suitable measure of association that can identify collocations-which are defined in the textbook as "an expression consisting of two or more words that correspond to some conventional way of saying things."- in a large corpora of text. I had to implement a measure that can be used with bigrams and trigrams and then compare that measure with other measures that have been implemented in the N-gram Statistics Package. There are many different ways of measuring association amongst collocations(of different lengths, say 2 or 3).There are several methods that have been implemented in the N-grams Statistics Package already. These are, the Pearson's Chi Squared Test, Log Likelihood Ratio, Dice Coefficient, Pointwise Mutual Information and the Left Fisher's test. I had to implement 'True' Mutual information test for collocations of length 2 and find and implement another suitable test to identify collocations(which had to be extended and implemented for collocations of length 3). I also had to implement log likelihood ratio for collocations of length 3.I then had to compare the results from the various measures I have implemented. Measure Used and Implemented ---------------------------- The measure that I found and used in my experiments is a variation of the cross-product ratio called the Yule's Coefficient of Colligation. In the cross product ratio, we get the measure of association 'alpha' which is the ratio of cross products in the first row to that in the second and is calculated thus: alpha = (xy * xbarybar) / (xybar * ybarx) When there is no association present, alpha = 1 otherwise alpha has a value either greater to 1 or less than 1, depending on the degree of association. We can also generalize the cross-product ratio to more than 2 dimensions(useful to us for extending the test to trigrams).For three dimensions, the cross product ratio is calculated like this: alpha = (xyz * xybarzbar * xbaryzbar * xbarybarz) / (xyzbar * xybarz * xbaryz * xbarybarzbar) I didnt however use this value of alpha because of the unbounded nature of alpha. Hence, I used a transformation of alpha which restricts alpha's range to a finite interval. Coefficient of Colligation bounds the value of alpha between -1 and 1. It is calculated thus: Y = (sqrt(alpha) - 1) / (sqrt(alpha) + 1) Text Used --------- I used a corpus of over 1 million words taken from the Brown Corpus that had been made available to us at /home/cs/tpederse/CS8761/BROWN1. I created a file called CORPUS (available at /home/cs/jain0069/CS8761/nsp), which I created by concatenating the data present in all the files in the BROWN1 directory. Experiment 1 ------------ I had to implement 'True' mutual information for collocations of length 2. I used the following formula to implement true mutual information: I(X;Y) = summation for all values of X and Y(Joint Frequency(x,y) * log(Joint Frequency(x,y)/(Frequency(x)*Frequency(y)))) I created and implemented this in a file called tmi.pm . I then ran the program count.pl of the N-gram Statistics Package to extract the bigrams from the corpus.After that I ran the statistic.pl program on the output file produced by count.pl using the tmi.pm file. The results were stored in an output file called outtmi.txt. At the Command Line: ./count.pl testoutput.txt CORPUS ./statistic.pl tmi.pm outtmi.txt testoutput.txt I then implemented the 2 length collocation version of Coefficient of Colligation in a file called user2.pm .I repeated the procedure followed earlier and got the final result in an output file called outuser2.txt . At the Command Line: ./statistic.pl user2.pm outuser2.txt testoutput.txt Comparison of the Top 50 Ranks between tmi.pm and user2.pm ----------------------------------------------------------- The total number of bigrams that were extracted from the CORPUS were 1,336,151. I give the snapshot of the top ten values extracted by both tmi.pm and user2.pm in the table below. TMI Table --------- Collocation Extracted Rank TMI Value XY Value X Value Y Value . The 1 0.0135 4961 49673 6921 of the 2 0.0102 9181 36043 62690 . He 3 0.0071 2421 49673 2991 in the 4 0.0062 5330 19581 62690 . It 5 0.0049 1718 49673 2184 , and 5 0.0049 5530 58974 27952 . In 6 0.0037 1320 49673 1736 , but 7 0.0034 1648 58974 3006 . But 8 0.0033 1115 49673 1371 the the 8 0.0033 3 62690 62690 USER2 Table ----------- Collocation Extracted Rank Coeff. of Coll. Value XY Value X Value Y Value Export Import 1 0.9995 14 15 15 Yugoslav Claims 2 0.9993 7 8 8 PROVIDENCE PLANTATIONS 2 0.9993 6 7 7 Push Pull 3 0.9992 9 11 10 Dolce Vita 4 0.9991 4 5 5 Tar Heel 4 0.9991 8 10 9 Taft Hartley 5 0.9990 3 4 4 Baton Rouge 5 0.9990 3 4 4 Lo Shu 5 0.9990 17 20 19 Grands Crus 6 0.9988 2 3 3 I notice that amongst the top 50 ranked collocations found by tmi.pm and user2.pm, the ones that have been found by user2.pm seem to be much more interesting that the one that has been found by tmi.pm . As can be seen by the cross section of the top ten values given in the user2 table, the collocations that have been ranked near the top, seem to be occurring 'together' more frequently 'whenever' they are appearing in the corpus. For example, the top ranked collocation for USER2 is 'Export Import'. The words Export and Import occur together 14 times in the corpus and only once do they appear independently. As we look down the table, indeed the file, we see a similar pattern. The collocations ranked occur together in a higher proportion to their total occurance in the entire corpus.The collocations(if they can be called collocations!?) extracted by the TMI test appear to be mostly non-interesting. I think the reason behind this is that TMI goes for finding higher frequency words of any kinds which means that even uninteresting words like 'the', 'in', 'but' etc. are also incorporated in its pool of collocations. Cut-Off Point ------------- I haven't been able to notice any particular cut-off point in the output of tmi.pm . However, in the output of user2.pm I noticed something to the effect of a cut-off point. If we look at the collocations up to around rank 60, we notice that most of them are interesting. But as we go down the file, the number of interesting collocations keep on reducing. They reduce pretty significantly after around that rank. Hence we can say that the rank 60 is the cut-off point as far as user2.pm is concerned. Note: I need to mention here that more than one collocation has the same rank so if I say up to rank 60, it doesn't mean, just 60 collocations. Rank Comparison --------------- I first compared the output of log-likelihood ratio for two word collocations with the result of 'true' mutual information. On running rank.pl: Rank Correlation Coefficient = -0.9667 This signifies that the two tests are absolutely not similar to each other in the output they produce.I then compared the output of Point-Wise Mutual information for two word collocations with the results of 'true' mutual information. On running rank.pl: Rank Correlation Coefficient = -0.9667 Again we see, that the Rank Correlation Coefficient is very near to -1 which signifies that the test are absolutely dissimilar to each other in the collocations they produce according to the ranks. Now running the rank.pl to compare user2.pm with log-likelihood ratio, we get: Rank Correlation Coefficient = 0.6180 This result is significant as it tells us that my test, implemented in user2.pm is much closer in its output to the log-likelihood ratio output that the true mutual information test. Now we compare user2.pm with Point-Wise Mutual information. On running rank.pl, we get: Rank Correlation Coefficient = 0.7605 Here we notice that the Rank Correlation Coefficient is even closer to 1. Hence we can conclude that the output's of Point Wise Mutual Information and user2 are even more close to each other than those of log likelihood and user2. Hence we can conclude that while TMI is not like any of the other two, Coefficient of Colligation gives pretty similar output to log-likelihood test and even more similar output to PMI. Overall Recommendation ---------------------- Choosing between TMI and Coefficient of Colligation(CoC), we can easily conclude that it would be better to use CoC, since we have seen better results given by CoC almost everywhere. We saw the rank comparison and the collocations given by both the tests. Choosing between CoC, Log Likelihood and MI would however be a difficult job. We saw the first ten values in TMI and CoC output a little bit earlier. Now lets look at the first ten values in the output of Log Likelihood and MI and may be that would help us decide, which indeed is a better measure at finding collocations. The snapshot table of Log Likelihood is as follows: Log Likelihood -------------- Collocation Extracted Rank Log L.Value XY Value X Value Y Value . The 1 25043.4196 4961 49673 6921 of the 2 18856.1038 9181 36043 62690 . He 3 13183.4486 2421 49673 2991 in the 4 11393.6840 5330 19581 62690 . It 5 9139.3156 1718 49673 2184 , and 6 9072.9347 5530 58974 27952 . In 7 6844.0270 1320 49673 1736 , but 8 6309.5959 1648 58974 3006 the the 9 6126.7892 3 62690 62690 . But 10 6064.5057 1115 49673 1371 Point-Wise Mutual Information ----------------------------- Collocation Extracted Rank MI Value XY Value X Value Y Value Burly leathered 1 20.3497 1 1 1 Capetown coloreds 1 20.3497 1 1 1 Bietnar Haaek 1 20.3497 1 1 1 aber keine 1 20.3497 1 1 1 Despina Messinesi 1 20.3497 1 1 1 Thoroughbred Racing 1 20.3497 1 1 1 Josephus Danielswas 1 20.3497 1 1 1 POLAND _FRONTIERS 1 20.3497 1 1 1 cha chas 1 20.3497 1 1 1 Recherche Scientifique 1 20.3497 1 1 1 As we can notice in the tables, the log likelihood is giving disappointing values for its top ten ranked collocations. As in the True Mutual Information, we dont see anything that might be classified as interesting. Hence we cant claim it to be performing better than user2. However, when we look at the data for Point Wise Mutual Information, we see that there are some better results. If we keep looking down the table, we see collocations like TRAVEL CLUB 6 18.0277 1 1 5 These are interesting collocations that occur very frequently using this test. But, the one problem noticed by me in this test is that most of the collocations searched by this test a cluster of the collocations have the same MI value and the same rank. Hence if we are going to look at only the rank to determine which are the interesting collocations we would definitely have a difficult time because there are just too many collocations with the same rank and to bring out a few interesting collocations is going to be almost impossible. Hence, I would say that using user2 to calculate 2 word collocations would be the best thing to do in this case, according to the test results. Experiment 2 ------------ I had to implement the log likelihood ratio for collocations of length 3 in this experiment. The formula that I use for that is: G(X,Y,Z) = summation for all values of X,Y and Z(Frequency(x,y,z) * log(Frequency(x,y,z)/Expected Value)) The expected value is calculated thus: E(X,Y,Z) = (Frequency(X) * Frequency(Y) * Frequency(Z)) / (Total Trigrams * Total Trigrams) I created and implemented this in a file called ll3.pm . I then ran the program count.pl of the N-gram Statistics Package to extract the trigrams from the corpus.After that I ran the statistic.pl program on the output file produced by count.pl using the ll3.pm file. The results were stored in an output file called outll3.txt. At the Command Line: ./count.pl outputtri.txt CORPUS ./statistic.pl ll3.pm outll3.txt outtri.txt I then implemented the 3 length collocation version of Coefficient of Colligation in a file called user3.pm .I repeated the procedure followed earlier and got the final result in an output file called outuser3.txt . At the Command Line: ./statistic.pl user3.pm outuser3.txt outtri.txt Comparisons of Top 50 Ranks between ll3.pm and user3.pm ------------------------------------------------------- The total number of trigrams extracted from the corpus were 1,336,150.Again, I give snapshots of the top 10 values that I got from the output of both ll3.pm and user3.pm. These are: Log Likelihood Ratio (for Collocations of length 3) --------------------------------------------------- Collocation Extracted Rank TMI Value XYZ Value X Value Y Value Z Value XY Value XZ Value YZ Value N01 0680 chair 1 89726.2127 1 166 500 66 1 1 1 A20 0710 Britain 2 89663.1952 1 187 497 61 1 2 1 a belligerent motion 3 89653.3366 1 21998 5 59 1 3 1 J03 0151 be 4 87199.4986 1 176 2 6340 1 4 1 worn with crimson 5 87197.4879 1 26 7007 8 1 1 1 new culture ; 6 87184.0217 1 1045 58 2753 1 6 1 ,Selena said 7 86914.4145 1 58974 4 1950 1 614 1 conditions prevail ; 8 85214.6164 1 178 7 2753 1 1 1 enormous installationsat9 85214.4259 1 37 16 4966 1 1 1 source F17 1680 10 85211.4453 1 90 177 458 1 1 1 USER 3 TABLE ------------ Collocation Extracted Rank TMI Value XYZ Value X Value Y Value Z Value XY Value XZ Value YZ Value 9 a the 1 0.9937 1 95 21998 62690 2 2 2 parts a and 2 0.9847 2 109 21998 27952 3 3 8 thyroid . . 3 0.9845 1 38 49673 49674 3 3 2 of will she 4 0.9842 1 36043 2201 2087 2 9 2 as , One 5 0.9834 1 6691 58974 417 10 2 2 be you also 6 0.9833 1 6340 2979 997 2 2 2 to and including 7 0.9828 2 25725 27952 170 16 3 3 for or against 8 0.9813 2 8833 4127 618 3 3 5 on or before 9 0.9810 7 6405 4127 947 11 10 8 may or may 10 0.9802 10 1286 4127 1286 11 11 15 The two sets of data, when examined, do not reveal too much of dissimilarity, in extracting collocations of length 3. Atleast on the face value it looks like that.We don't see too many interesting collocations in the top ten values, as shown in the table. However, there are other values that can be considered of interest scattered over the top few ranks. For example in outuser3.txt we have the following at rank 14: As of now 14 0.9775 1 540 36043 1037 3 2 2 This seems to be an interesting collocation, present a bit higher up in the file. But as we go down, we see that the quality of the collocations extracted keeps on deteriorating till we find that no interesting collocations are being extracted by the user3.pm test. An interesting thing that i noticed in the log likelihood ratio test results were the unusually high frequency of collocations, either begining with a period(.) or having a period somewhere in between. For example: . The oilheating 64 78973.6002 1 49673 6921 2 4961 1 1 . The raped 65 78567.5017 1 49673 6921 2 4961 1 1 apple . The 66 78516.2169 1 8 49673 6921 1 1 4961 . The lordly 67 78508.7386 1 49673 6921 2 4961 1 1 As we can see, four consecutive ranks have the period in it. Infact, the next 10 ranks were the same. Cut Off Point ------------- As I said, the collocations that are extracted by user3.pm are not very interesting. However, do they show any trends or cut off points? It does seem so. Again, if we got beyond the ranks 55-60, we see a decided decrease in the interesting nature of the collocation. Hence we can say that the cut off point for user 3 is around the rank 55. We do notice a certain trend in the log likelihood data as well. We see that beyond the rank 35, the collocations mostly have uninteresting charecteristics. There are obviously a few collocations that exist and are interesting, but they are lost in a maze of too many uninteresting collocations. Overall Recommendation ----------------------- Here we have to choose only between the two tests, log likelihood and the Coefficient of Colligation test. Though the CoC test performs much better at two word collocations, when it comes to three word collocations, each of them produce equally uninteresting results. Especially, log likelihood with its excessive periods, doesnt leave any room for getting some interesting collocations. So I would say that neither of the tests can be really considered as being really better than each other. Each of them perform below par for the given set of data. Conclusion ----------- To conclude I would like to say that the measure that I found out and used to find interesting collocations in large corpora of text, gave mixed results. Giving very good results at finding 2 word collocations but giving below average results when searching for interesting 3 word collocations. We did however learn that even the log likelihood ratio for 3 word collocations doesnt actually give too good a results. So we can conclude by saying that we still need to find a decent method for extracting 3 word collocations from large corpora of text because the ones that were analyzed here do not seem to be doing there job up to par. References ----------- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schutze. MIT Press. Multiway Contingency Table Analysis for Social Sciences by Thomas D. Wickens http://www2.chass.ncsu.edu/garson/pa765/assoc2x2.htm http://www.spss.com/tech/stat/algorithms/11.0/cluster.pdf ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Name : Rashmi Vinay Kankaria Assignement no 2 Objectives : To find a measure of association and implement the same for bigrams and trigrams.Also find tmi for given bigrams and log likelihood ratio foe trigrams. Experiment 1 : First Module : tmi.pm This module is implemented to find true mutual information of given bigrams true mutual information can be formulated as , I(X:Y) = sum (over x, y) [ P(X,Y) * log ( P(X,Y)/ P(X)P(Y) ) ] 1629547 .<>The<>1 19721.9494 5593 65255 7800 of<>the<>2 16166.1332 10710 42133 74521 ,<>and<>3 12077.0140 8584 77904 34436 .<>He<>4 11447.6870 2986 65255 3671 in<>the<>5 10115.0110 6403 23448 74521 .<>It<>6 9809.9502 2555 65255 3130 ,<>but<>7 6486.1348 2362 77904 4111 .<>I<>8 5816.9501 3392 65255 12556 .<>But<>9 5762.9153 1506 65255 1846 to<>be<>10 5444.1418 2067 32004 7847 .<>In<>11 5293.2955 1458 65255 1918 ,<>,<>12 5186.6391 61 77904 77904 the<>the<>13 5118.9997 3 74521 74521 had<>been<>14 4815.7797 956 6773 3290 on<>the<>15 4557.6574 2441 7241 74521 have<>been<>16 4459.8949 875 5882 3290 This is the output file of the tmi.pm.The bigram ". The" has occured maximum number of times and so has rank one. the mutual information is given as 19721.9494. Of 1629547 number of bigrams, 5593 are ", the" well this bigram has not occured maximum number of times( as you one can seee the frequency of "of the" 10710 however still this has rank one as the mutual information is mroe in this . this means that these 2 words "." and " The" tend to occur more together than another words occuring together individually or otherwise. ll and tmi : 0.9932 Second Module : ll3.pm This module is implemented to perform the log likelihood ratio test for trigrams with the assumption of Null Hypothesis P(x,y,z) = p(x)p(y)p(z) The ratio is given by the following formula : G2 = 2 summation (over x,y) [ (oij) * log ( oij / eij )] where i < x,j < y oij = obsereved value (in ith row and jth column) eij = expected value (in ith row and jth column) Log likelihood for trigrams can be intuitively visualised as a cuboid with 3 co-ordinate axes x,y,z . thus the expexted frequency of any sample will be dependant not just on its total occurance in x and y place but also in z plane. ie "in this case", the frequency of of is not jst calculated wrt to this but also with respect to "case" and " this case". Third Module and Fourth Module : user2.pm and user3.pm This module implements a measure of association called Log Odds Ratio for bigrams. for 2 * 2 matrix, y y' x a b x' c d the value of a/b gives odds y v/s y' of y for x while value c/d gives odds y v/s y' of y for x' now the ratio of these 2 odds will give us how much the odds of x increases for y given the value of y for x' ( thus y ' making a reference point : intuitively) In simple words, suppose that A can be 70 % yes and 30% no then the chances of a beinf yes is 70/30 times that of no. also suppose B can be 60% yes and 40 % no then the odds here is 60/40 which menas that this many times B can be yes given the chance of no to be 40%. now the odds ratio for A and B will be 70/30 / 60/40 = f which means that "A will be "yes" x times B given their individual chances of no .. Considering the chances of no is important because "yes" and "no" together make the sample space for each of them. thus this reasoning can be extended to 3 variales also as the ratio of odds taken under same circumstances either it is 2 variables or three variables. Thus we can conclude that it is a relative measure and thus can be easily accomodated as a measure of association. generally working with logarithmic values is more managable so we have log odd's ratio where value can also go negative.howver higher the value indiacates that they asre related while as it goes on decreading they are less dependant( their occurance together is less likely) here is sample output of user2.pm... 1629547 ,<>000<>1 29.3663 446 77904 446 Sherlock<>Holmes<>2 28.2934 202 202 963 couldn<>t<>3 28.0851 175 175 2448 wasn<>t<>4 27.8913 153 153 2448 wouldn<>t<>5 27.7528 139 139 2448 Hor<>.<>6 27.3715 111 111 65268 hadn<>t<>7 27.3343 104 104 2448 isn<>t<>8 27.2037 95 95 2448 Laer<>.<>9 26.5313 62 62 65268 Project<>Gutenberg<>10 26.4689 57 86 57 Oph<>.<>11 26.4350 58 58 65268 Clown<>.<>12 26.1006 46 46 65268 ain<>t<>13 26.0262 42 42 2448 Ay<>,<>14 25.7353 36 36 77904 O<>Banion<>15 25.7233 34 261 34 Guil<>.<>16 25.7063 35 35 65268 Bang<>Jensen<>17 25.5902 31 31 35 aren<>t<>18 25.5881 31 31 2448 Osr<>.<>19 25.2209 25 25 65268 shouldn<>t<>20 25.2188 24 24 2448 of<>he<>90077 -7.9584 1 42133 9317 .<>by<>90078 -7.9589 1 65255 5943 a<>of<>90079 -7.9596 3 27651 42133 .<>you<>90080 -7.9898 1 65255 6071 to<>was<>90081 -7.9949 1 32004 12634 .<>is<>90082 -8.0481 2 65255 12590 .<>the<>90083 -8.0870 12 65255 74521 of<>for<>90084 -8.1316 1 42133 10498 a<>to<>90085 -8.1386 2 27651 32004 .<>had<>90086 -8.1483 1 65255 6773 The<>.<>90087 -8.3537 1 7802 65268 the<>s<>90088 -8.3777 1 74521 6911 .<>with<>90089 -8.5395 1 65255 8870 .<>his<>90090 -8.5531 1 65255 8954 .<>a<>90091 -8.6123 3 65255 27651 The<>,<>90092 -8.6207 1 7802 77904 and<>.<>90093 -8.9356 3 34436 65268 the<>is<>90094 -9.2484 1 74521 12590 to<>to<>90095 -9.3536 1 32004 32004 a<>the<>90096 -9.3976 2 27651 74521 the<>the<>90097 -10.2880 3 74521 74521 the<>a<>90098 -10.3977 1 74521 27651 As you see the ", space" has got rank one which means that they are highly dependant. as u see more .. t preceeded by shoudn and aren does show that t will occu with this. the last part of the same output show that the data like "the a" and "of he" etc are not at all dependant and can conclude that negative values indicate independance between the words in the bigrams.thus here is a balance point between the independant and dependant variable values . TOP 50 COMPARISONS between user2.pm and tmi.pm user.pm : 1629547 ,<>000<>1 29.3663 446 77904 446 Sherlock<>Holmes<>2 28.2934 202 202 963 couldn<>t<>3 28.0851 175 175 2448 wasn<>t<>4 27.8913 153 153 2448 wouldn<>t<>5 27.7528 139 139 2448 Hor<>.<>6 27.3715 111 111 65268 hadn<>t<>7 27.3343 104 104 2448 isn<>t<>8 27.2037 95 95 2448 Laer<>.<>9 26.5313 62 62 65268 Project<>Gutenberg<>10 26.4689 57 86 57 Oph<>.<>11 26.4350 58 58 65268 Clown<>.<>12 26.1006 46 46 65268 ain<>t<>13 26.0262 42 42 2448 Ay<>,<>14 25.7353 36 36 77904 tmi.pm : 1629547 .<>The<>1 19721.9494 5593 65255 7800 of<>the<>2 16166.1332 10710 42133 74521 ,<>and<>3 12077.0140 8584 77904 34436 .<>He<>4 11447.6870 2986 65255 3671 in<>the<>5 10115.0110 6403 23448 74521 .<>It<>6 9809.9502 2555 65255 3130 ,<>but<>7 6486.1348 2362 77904 4111 .<>I<>8 5816.9501 3392 65255 12556 .<>But<>9 5762.9153 1506 65255 1846 to<>be<>10 5444.1418 2067 32004 7847 .<>In<>11 5293.2955 1458 65255 1918 , in true mutual information among the 2 words while user2 gievs us an idea about the independance and dependance between the two words in the bigrams. if one ses the user2.pm the higher ranks just suggest that though these words have occured very less time individually ,in case of "wasn , t ", the first word occruing in that place given the chance that the second word occured in that place..eg occurance of wasn given that of t is high that means that it ranks the bigrams depending on the occurance of first word given the occurance of the bigram which can be interesting measure for "prediction" too. in case of true mutual information, the stress is more on the individual words occuring together given the chances of their own occurance and occurance together and thus it gives urs the complete information about two words,occuring together , not occuring together , one occuring but not other .So if you see the higher side eg,words like "the,".","he" which have highly occured in the text and thus having higher probabaility are making the higher ranks as compared to others whose rank can be lower because of their individual frequecies and joint frequencies.this can be a good measure to decide the probability of comman words in a texta nd their typical associations. eg association of " of" with " the" or " to" with " be". thus both the measure look differently at the text. CUTOFF POINT : in case user2, the cutoff will be anything below zero after which the bigrams wont make sense as they cant occur like that in text. RANK COMPARISON : rank values for following ... tmi and mi : 0.8086 user2 and mi : 0.9669 rank values : ll and user2 : 0.8080 mi and user2 : 0.9669 this shows that the user2 computes similar results. Overall Recommendation : The measure of associations used give different interpretation of the given corpora and so they have their own stand. References : Books : Foundations of statistical natural language processing : Manning and Schutze Introduction to probability and mathematical statistics : Bain and Engelhardt Nonparametric Statistical Inference : J.Gibbons and S. Chakraborti Multiway Contingency Tables Analysis for the Social Sciences : Thomas Wickens. urls : http://home.clara.net/sisa/two2hlp.htm http://www.mathpsyc.uni-bonn.de/doc/cristant/node15.html http://www.ci.tuwien.ac.at/~zeileis/teaching/slides4.pdf ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ NAME: SUMALATHA KUTHADI CLASS: NATURAL LANGUAGE PROCESSING DATE: 10 / 11 / 02 CS8761 : ASSIGNMENT 2 -> TO INVESTIGATE VARIOUS MEASURES OF ASSOCIATION THAT CAN BE USED TO IDENTIFY COLLOCATIONS IN LARGE CORPORA OF TEXT. INTRODUCTION: -> COLLOCATIONS: A COLLOCATION IS AN EXPRESSION THAT CONSISTS OF TWO OR MORE WORDS THAT CORRESPOND TO SOME CONVENTIONAL WAY OF SAYING SOMETHING. -> TEST OF ASSOCIATION: IT IS A TEST WHICH IS CARRIED OUT TO FIND THE "ASSOCIATION AMONG THE WORDS" IN A COLLOCATION. -> "ASSOCIATION" AMONG THE WORDS IS ALSO CALLED AS "MUTUAL INFORMATION". -> "MUTUAL INFORMATION" MEASURES THE DEGREE OF DEPENDENCE BETWEEN THE WORDS IN A COLLOCATION. IT IS THE AMOUNT OF INFORMATION ONE RANDOM VARIABLE CONTAINS ABOUT ANOTHER RANDOM VARIABLE. -> WHEN TEST OF ASSOCIATION IS PERFORMED ON A BIGRAM, IT ASSIGNS A SCORE AND A RANK TO THE COLLOCATION. RANK GIVEN IS BASED ON THE SCORE. EXPERIMENTS DONE IN THE ASSIGNMENT: -> tmi.pm: THIS TEST FINDS THE TRUE MUTUAL INFORMATION AMONG THE WORDS IN THE COLLOCATION AND ASSIGNS SCORES TO THE THEM. RANKS ARE BASED ON THE SCORES GIVEN TO BIGRAMS. FORMULA USED: I(X:Y)= SUMMATION(P(X:Y) * log(P(X:Y)/P(X)*P(Y)) ); I- TRUE MUTUAL INFORMATION P(X:Y)-JOINT PROBABILITY OF THE WORDS IN BIGRAM P(X)- PROBABILITY OF FIRST WORD IN BIGRAM P(Y)- PROBABILITY OF SECOND WORD IN BIGRAM -> user2.pm: THE TEST USED IN THIS PROGRAM IS u-test. THIS PROGRAM FINDS SIGNIFICANT COLLOCATIONS. IT ALSO ASSIGNS RANKS TO THE COLLOCATIONS. THESE RANKS ARE BASED ON THE SCORES ASSIGNED TO THE COLLOCATIONS. SCORES AND RANKS ARE INVERSELY PROPORTIONAL TO EACH OTHER. FORMULAE USED: u = (jointprob - indprob) / sqrt(indprob * (1 - indprob) / npp); jointprob = p(w1,w2) = n(w1,w2) / npp; indprob = p(w1,w2) = (n(w1) * n(w2)) / (npp * npp); n(w1) -> number of times w1 occurs as 1st word in bigram. n(w2) -> number of times w2 occurs as 2nd word in bigram. n(w1,w2)-> number of times both w1 & w2 occurs as 1st & 2nd words respectively in the bigram. npp -> total number of bigrams REFERENCE: http://icl.pku.edu.cn/yujs/papers/pdf/h-test.pdf -> user3.pm : THIS PROGRAM IS AN EXTENSION OF user2.pm. THIS PROGRAM FINDS TRIGRAMS. ALSO, ASSIGNS SCORES AND RANKS TO TRIGRAMS. FORMULAE USED: u = (jointprob - indprob) / sqrt(indprob * (1 - indprob) / npp); jointprob = p(w1,w2,w3) = n(w1,w2,w3) / nppp; indprob = p(w1,w2,w3) = (n(w1) * n(w2) * n(w3)) / (nppp * nppp * nppp); n(w1) -> number of times w1 occurs as 1st word in trigram. n(w2) -> number of times w2 occurs as 2nd word in trigram. n(w3) -> number of times w3 occurs as 3rd word in trigram. n(w1,w2,w3)-> number of times all w1,w2 and w3 occur as 1st,2nd and 3rd words respectively in the trigram. nppp -> total number of trigrams. -> ll3.pm: "LOGLIKELIHOOD TEST" IS USED IN THIS PROGRAM. THIS PROGRAM FINDS SIGNIFICANT COLLOCATIONS . THIS PROGRAM ALSO ASSIGNS SCORES AND RANKS TO COLLOCATIONS. FORMULAE USED: g2 = 2*summation(nijk * log (nijk * eijk)); nijk ->frequency of trigram ijk. eijk ->estimation of trigram ijk. eijk = nipp * npjp * nppk / nppp. nppp -> total number of trigrams. EXPERIMENT 1: TOP 50 COMPARISON: ->FROM THE TOP 50 RANKS PRODUCED BY user2.pm AND tmi.pm, I OBSERVE THAT tmi.pm ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS, WHICH OCCUR FREQUENTLY. user2.pm ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS, IN WHICH 1ST WORD OCCURS ONLY WITH 2ND WORD OR A FEW OTHER WORDS AND VICE-VERSA. ->FOR EXAMPLE: TOP6 RANKS GIVEN BY user2.pm AND tmi.pm. ->OUTPUT OF user2.pm: BINI<>SALFININISTAS<>1 1155.9191 1 1 1 Multiphastic<>Personality<>1 1155.9191 1 1 1 Theodor<>Reik<>1 1155.9191 1 1 1 PRINCE<>WEARS<>1 1155.9191 1 1 1 Birdie<>Gevurtz<>1 1155.9191 1 1 1 Manon<>Lescaut<>1 1155.9191 1 1 1 -> OUTPUT OF tmi.pm: .<>The<>1 0.0135 4961 49673 6921 of<>the<>2 0.0102 9181 36043 62690 .<>He<>3 0.0071 2421 49673 2991 in<>the<>4 0.0062 5330 19581 62690 ,<>and<>5 0.0049 5530 58974 27952 .<>It<>5 0.0049 1718 49673 2184 ->I THINK user2.pm IS BETTER THAN tmi.pm AT IDENTIFYING SIGNIFICANT COLLOCATIONS BECAUSE user2.pm IS GOOD AT IDENTIFYING COLLOCATIONS WHICH HAVE SOME MEANING AND ASSIGNS THEM HIGH RANKS. tmi.pm CONSIDERS COLLOCATIONS, WHICH OCCUR FREQUENTLY AS BEST COLLOCATIONS. WORDS LIKE I, HE, HAD, THE, A ETC AND PUNCTUATIONS OCCUR FREQUENTLY IN ANY TEXT. THESE WORDS AND PUNCTUATIONS ARE ASSOCIATED WITH MANY OTHER WORDS. SO,COLLOCATIONS WHICH CONTAIN THESE WORDS AND PUNCTUATIONS ARE CONSIDERED AS BEST COLLOCATIONS BY tmi.pm. ->IN THE TEXT I USED "I HAD" OCCURED 810 TIMES AND "NOMINALLY ESTIMATED" OCCURED ONLY 1 TIME. ACCORDING TO user2.pm, "NOMINALLY ESTIMATED" IS THE BEST COLLOCATION WHERE AS ACCORDING TO tmi.pm, "I HAD" IS A BEST COLLOCATION. CUTOFF POINT: ->CUTOFF POINT FOR user2.pm: I THINK THE OUTPUT OF user.pm HAS A NATURAL CUT OFF POINT FOR SCORES AT 1155.9191. PART OF THE OUTPUT: IT CONTAINS SCORE CUTOFF. CONTINUUM<>MECHANICS<>1 1155.9191 1 1 1 Stabat<>Mater<>1 1155.9191 1 1 1 Chou<>En<>2 1155.9182 2 2 2 Nogay<>Tartary<>2 1155.9182 2 2 2 Pee<>Wee<>2 1155.9182 2 2 2 ->"I OBSERVED THAT ABOVE THE CUTOFF POINT, ALL THE COLLOCATIONS HAVE WORDS THAT DO NOT APPEAR IN ANY OTHER COLLOCATIONS." ->CUTOFF POINT OF tmi.pm: I THINK THE OUTPUT OF tmi.pm HAS A NATURAL CUT OFF POINT AT 0.0001 BECAUSE AFTER THAT POINT SCORE OF ALL COLLOCATIONS IS ZERO. FROM THE OUTPUT TABLE, I OBSERVED THAT ALL FREQUENCY COMBINATIONS OF ALL BIGRAMS ARE GREATER THAN ZERO. PART OF THE OUTPUT: IT CONTAINS SCORE CUTOFF. contact<>with<>39 0.0001 30 59 7007 returned<>to<>39 0.0001 52 115 25725 wavers<>and<>40 0.0000 1 2 27952 agglutination<>of<>40 0.0000 1 4 36043 effective<>they<>40 0.0000 2 127 2855 ->"I THINK 0.0001 IS THE CUTOFF POINT BECAUSE AFTER THAT POINT ALL COLLOCATIONS HAVE SAME RANK. MIGHT BE SCORE VALUE IS ROUNDED TO 4 POINT PRECESION, BUT THE ACTUAL SCORE OF ALL BIGRAMS MIGHT BE ALMOST EQUAL TO EACH OTHER. I THINK BELOW THE CUT OFF POINT ALL BIGRAMS ARE ALL EQUALLY SIGNIFICANT. RANK COMPARISION: *->COMPARING user2.pm AND tmi.pm WITH mi.pm: RANK CORRELATION COEFFICIENT OF user2.pm WITH mi.pm: 0.9523 RANK CORRELATION COEFFICIENT OF tmi.pm WITH mi.pm: -0.9528 ->BASED ON "RANK CORRELATION COEFFICIENT" AND "RANKINGS" GIVEN BY THE BOTH TESTS, I THINK THAT user2.pm IS MORE LIKE mi.pm. ->RANKS GIVEN BY tmi.pm AND mi.pm ARE TOTALLY DIFFERENT. *->COMPARING user2.pm AND tmi.pm WITH ll.pm: RANK CORRELATION COEFFICIENT OF user2.pm WITH ll.pm: 0.8870 RANK CORRELATION COEFFICIENT OF tmi.pm WITH ll.pm: -0.9527 ->BASED ON "RANK CORRELATION COEFFICIENT" I THINK THAT user2.pm IS MORE LIKE ll.pm. OVERALL RECOMMENDATION: -> I THINK user2.pm IS BETTER THAN ll.pm, tmi.pm AND mi.pm. tmi.pm AND ll.pm CONSIDER FREQUENTLY OCCURING COLLOCATIONS AS SIGNIFICANT COLLOCATIONS. SO THEY CONSIDER COLLOCATIONS WHICH HAVE PUNCTUATIONS AND WORDS LIKE THE, HE, IT ETC AS SIGNIFICANT COLLOCATIONS. -> BETWEEN user2.pm AND mi.pm, I THINK user.pm IS BETTER. BECAUSE, user.pm CONSIDERS COLLOCATIONS, WHICH CONTAINS WORDS THAT OCCUR ONLY TOGETHER AND NOT WITH ANY OTHER WORDS, AS SIGNIFICANT COLLOCATIONS. THESE SIGNIFICANT COLLOCATIONS WILL NOT HAVE PUNCTUATIONS AND WORDS LIKE THE, HE ,IT ETC. EXPERIMENT 2: TOP 50 RANK: ->RANK CORRELATION COEFFICIENT: -0.7953 ->FROM THE TOP 50 RANKS PRODUCED BY user3.pm AND ll3.pm, I OBSERVE THAT ll3.pm ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS WHICH OCCUR FREQUENTLY. user3.pm ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS IN WHICH 1ST WORD OCCURS ONLY WITH 2ND WORD OR A FEW OTHER WORDS AND VICE-VERSA. ->FOR EXAMPLE: TOP6 RANKS GIVEN BY user2.pm AND tmi.pm. ->OUTPUT OF ll3.pm: ,<>and<>,<>1 2923.7663 1 304 149 304 76 14 3 ,<>and<>the<>2 2400.4036 6 304 149 172 76 23 13 ,<>and<>I<>3 2164.3900 3 304 149 96 76 14 9 ,<>and<>my<>4 2096.5959 1 304 149 76 76 4 6 ,<>and<>was<>5 2031.9212 3 304 149 55 76 15 3 ,<>and<>that<>6 2024.7081 4 304 149 64 76 7 9 ->OUTPUT OF user3.pl true<>repenting<>prodigal<>1 3870.9997 1 1 1 1 1 1 1 wer<>n<>t<>1 3870.9997 1 1 1 1 1 1 1 Low<>Country<>wars<>1 3870.9997 1 1 1 1 1 1 1 secret<>burning<>lust<>1 3870.9997 1 1 1 1 1 1 1 eighteen<>years<>old<>1 3870.9997 1 1 1 1 1 1 1 battle<>near<>Dunkirk<>2 2737.2100 1 1 2 1 1 1 1 ->I THINK user3.pm IS BETTER THAN ll3.pm AT IDENTIFYING SIGNIFICANT COLLOCATIONS BECAUSE user2.pm IS GOOD AT IDENTIFYING COLLOCATIONS WHICH HAVE SOME MEANING AND ASSIGNS THEM HIGH RANKS. tmi.pm CONSIDERS COLLOCATIONS, WHICH OCCUR FREQUENTLY, AS BEST COLLOCATIONS. WORDS LIKE I, HE, HAD, THE, A ETC AND PUNCTUATIONS OCCUR FREQUENTLY IN ANY TEXT. THESE WORDS AND PUNCTUATIONS ARE ASSOCIATED WITH MANY OTHER WORDS. SO,COLLOCATIONS, WHICH CONTAIN THESE WORDS AND PUNCTUATIONS ARE CONSIDERED AS BEST COLLOCATIONS BY tmi.pm. ->IN THE TEXT I USED, "TRUE REPENTING PRODIGAL" OCCURED 1 TIME AND ", and ," OCCURED ONLY 76 TIMES. ACCORDING TO user3.pm, "TRUE REPENTING PRODIGAL" IS THE BEST COLLOCATION WHERE AS ACCORDING TO ll3.pm, ", and ," IS A BEST COLLOCATION. CUTOFF POINT: ->CUTOFF POINT FOR SCORES OF user3.pm: CUTOFF POINT: 3870.9997 ABOVE THIS POINT ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ONLY WITH THAT COLLOCATION. BELOW THIS POINT ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ALSO IN OTHER COLLOCATIONS. ->CUTOFF POINT FOR SCORES OF ll3.pm: CUTOFF POINT: -92244.4599 ABOVE THIS POINT, ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ONLY WITH THAT COLLOCATION. BELOW THIS POINT, ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ALSO IN OTHER COLLOCATIONS OVERALL RECOMMENDATION: I THINK user3.pm IS BETTER THAN ll3.pm.AS I SAID ABOVE, user3.pm CONSIDERS COLLOCATIONS, THAT HAVE SOME MEANING, AS SIGNIFICANT COLLOCATIONS. ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ # ********************************************************************************* # experiments.txt Report for Assignment #2 N-gram Statistics # Name: Yanhua Li # Class: CS 8761 # Assignment #2: Oct. 11, 2002 # ********************************************************************************* This experiment compares several measures of association in large corpora of text. I use the text from http://ibiblio.unc.edu/pub/docs/books/gutenberg/etext02/cwces10.txt Commands to use: First run count.pl cwces10.cnt cwces10.txt count.pl --ngram 3 cwces10_3.cnt cwces10.txt to get bigrams, trigrams and their frequencies. Then run statistic.pl mi cwces10.mi cwces10.cnt statistic.pl tmi cwces10.tmi cwces10.cnt statistic.pl user2 cwces10.user2 cwces10.cnt statistic.pl ll cwces10.ll cwces10.cnt statistic.pl --ngram 3 ll3 test.ll3 cwces10_3.cnt statistic.pl --ngram 3 cwces10.user3 cwces10_3.cnt %%%%%%%%%%%%%% EXPERIMENT 1 1.1 TOP 50 COMPARISON 1.1.1 user2.pm Using user2.pm (namely, phi coefficient method) to run NSP Part of results ********************************************* 136218 SEND<>MONEY<>1 1.0000 1 1 1 loftiest<>pulpit<>1 1.0000 1 1 1 revised<>rates<>1 1.0000 1 1 1 irreclaimable<>givers<>1 1.0000 1 1 1 Petite<>Provence<>1 1.0000 1 1 1 HIS<>WIFE<>1 1.0000 3 3 3 NEW<>FEMININE<>1 1.0000 3 3 3 Revenue<>Service<>1 1.0000 1 1 1 CLEARING<>HOUSE<>1 1.0000 3 3 3 knicker<>bockers<>1 1.0000 1 1 1 palatial<>apartment<>1 1.0000 1 1 1 PREVAILING<>DISCONTENT<>1 1.0000 2 2 2 25<>Short<>1 1.0000 1 1 1 Copper<>Cylinder<>1 1.0000 1 1 1 ablution<>bowls<>1 1.0000 1 1 1 COMMON<>SCHOOL<>1 1.0000 1 1 1 THOUGHTS<>SUGGESTED<>1 1.0000 1 1 1 Treasury<>sends<>1 1.0000 1 1 1 NATHAN<>HALE<>1 1.0000 3 3 3 railroad<>fare<>1 1.0000 1 1 1 boats<>glide<>1 1.0000 1 1 1 angular<>corners<>1 1.0000 1 1 1 Uncle<>Sam<>1 1.0000 2 2 2 DON<>T<>1 1.0000 1 1 1 ENGLISH<>VOLUNTEERS<>1 1.0000 3 3 3 SOME<>CAUSES<>1 1.0000 2 2 2 brevity<>recommends<>1 1.0000 1 1 1 promotes<>trading<>1 1.0000 1 1 1 300<>billion<>1 1.0000 1 1 1 Small<>Print<>1 1.0000 6 6 6 . . . (totally 330 bigrams rank No.1) o<>clock<>2 0.9354 7 8 7 LEISURE<>CLASS<>3 0.8660 3 3 4 WITH<>AN<>3 0.8660 3 4 3 MS<>38655<>3 0.8660 3 4 3 hart<>pobox<>3 0.8660 3 4 3 AN<>EGO<>3 0.8660 3 3 4 Long<>Island<>4 0.8165 2 3 2 THIS<>ETEXT<>4 0.8165 2 2 3 EVEN<>IF<>4 0.8165 2 2 3 Private<>Theatricals<>4 0.8165 2 3 2 GET<>GUTINDEX<>4 0.8165 2 2 3 Meredith<>understands<>4 0.8165 2 2 3 DOMAIN<>ETEXTS<>4 0.8165 2 2 3 BUT<>NOT<>4 0.8165 2 2 3 Project<>Gutenberg<>4 0.8165 18 27 18 seventh<>son<>4 0.8165 2 3 2 Lydia<>Blood<>4 0.8165 2 3 2 Sandwich<>Islands<>4 0.8165 2 2 3 United<>States<>5 0.8086 34 34 52 . . . to<>and<>1533 -0.0279 1 3528 3960 to<>.<>1534 -0.0286 2 3528 4219 .<>and<>1535 -0.0304 2 4219 3960 .<>.<>1536 -0.0307 5 4219 4219 to<>of<>1537 -0.0337 1 3528 5668 in<>,<>1538 -0.0349 9 2828 8193 of<>and<>1539 -0.0358 1 5668 3960 ,<>of<>1540 -0.0365 105 8193 5668 of<>.<>1541 -0.0366 3 5668 4219 .<>of<>1542 -0.0370 1 4219 5668 a<>,<>1543 -0.0376 1 2975 8193 and<>,<>1544 -0.0388 27 3960 8193 to<>,<>1545 -0.0401 6 3528 8193 .<>,<>1546 -0.0447 3 4219 8193 of<>,<>1547 -0.0505 14 5668 8193 the<>,<>1548 -0.0639 1 8202 8193 (the end) **************************************** Phi coefficient = 1 means this bigram is highly dependent. We can see the last three digits at each row of No.1 ranked bigram are the same. That means those two words in this bigram always appear together so we think they are dependent. Actually this measure is not very good since most of those ranked No.1 bigrams only appear once in this large text and some of them are words seldom used. So we cant simply say they are dependent. TOP 50 all mean two words of the bigram have some kind of high dependence. The bottom ranks mean those two words are independent. Phi coefficient method is good at this measure because those words appear together for very few times but appear individually a lot. 1.1.2 tmi.pm Using tmi.pm (namely, "true" mutual information method) to run NSP Part of results ************************************** 136218 ,<>and<>1 0.0302 1533 8193 3960 of<>the<>2 0.0207 1392 5668 8202 .<>The<>3 0.0202 573 4219 659 .<>It<>4 0.0170 472 4219 508 in<>the<>5 0.0127 783 2828 8202 it<>is<>6 0.0112 414 1667 2624 It<>is<>7 0.0103 288 508 2624 ,<>but<>8 0.0092 352 8193 501 to<>be<>9 0.0087 341 3528 1183 .<>We<>10 0.0081 229 4219 260 .<>But<>11 0.0072 198 4219 203 .<>And<>12 0.0060 169 4219 193 is<>not<>13 0.0056 224 2624 1076 .<>In<>14 0.0051 143 4219 163 would<>be<>15 0.0050 131 405 1183 to<>the<>15 0.0050 523 3528 8202 ,<>or<>16 0.0049 262 8193 739 .<>This<>17 0.0048 135 4219 150 .<>There<>18 0.0047 133 4219 149 may<>be<>19 0.0046 114 301 1183 has<>been<>20 0.0045 93 518 263 .<>If<>21 0.0044 125 4219 143 the<>world<>22 0.0043 169 8202 253 have<>been<>23 0.0042 94 697 263 there<>is<>23 0.0042 129 318 2624 does<>not<>24 0.0041 86 118 1076 .<>He<>25 0.0037 104 4219 112 do<>not<>26 0.0036 87 229 1076 is<>a<>27 0.0035 234 2624 2975 ;<>but<>28 0.0034 89 660 501 for<>the<>28 0.0034 231 939 8202 they<>are<>29 0.0033 90 515 766 we<>have<>30 0.0032 91 652 697 on<>the<>31 0.0031 172 515 8202 the<>most<>32 0.0030 126 8202 229 should<>be<>32 0.0030 75 198 1183 must<>be<>32 0.0030 70 148 1183 I<>am<>33 0.0029 46 389 46 in<>a<>33 0.0029 216 2828 2975 will<>be<>33 0.0029 82 337 1183 by<>the<>34 0.0028 184 736 8202 United<>States<>34 0.0028 34 34 52 . (omit 110 raws) . . New<>England<>50 0.0011 16 62 56 as<>well<>50 0.0011 30 976 114 ,<>with<>50 0.0011 107 8193 682 there<>are<>50 0.0011 34 318 766 no<>more<>50 0.0011 30 328 409 .<>Some<>50 0.0011 30 4219 32 it<>would<>50 0.0011 46 1667 405 it<>may<>50 0.0011 42 1667 301 ,<>he<>50 0.0011 98 8193 562 . . . it<>to<>63 -0.0002 18 1667 3528 that<>of<>64 -0.0003 41 1912 5668 ,<>is<>64 -0.0003 126 8193 2624 in<>,<>64 -0.0003 9 2828 8193 is<>.<>64 -0.0003 15 2624 4219 not<>,<>64 -0.0003 25 1076 8193 ,<>a<>64 -0.0003 148 8193 2975 of<>that<>64 -0.0003 36 5668 1912 and<>to<>64 -0.0003 67 3960 3528 and<>is<>64 -0.0003 16 3960 2624 be<>,<>64 -0.0003 20 1183 8193 ,<>to<>65 -0.0004 174 8193 3528 is<>of<>65 -0.0004 19 2624 5668 that<>,<>65 -0.0004 30 1912 8193 is<>,<>66 -0.0005 94 2624 8193 of<>,<>66 -0.0005 14 5668 8193 and<>of<>67 -0.0006 52 3960 5668 and<>,<>67 -0.0006 27 3960 8193 ,<>of<>68 -0.0013 105 8193 5668 (the end) ************************************** From the result, we can see tmi method is much better than phi coefficient method at identifying significant or interesting collocations. TOP 50 shows very high dependence and that is the case. It's independence analysis is not as good as phi coefficient method as we can see from last 3 digits of bottom row bigrams. Therefore tmi is better in determining dependence but phi coefficient is better in determining independence. 1.2 CUTOFF POINT 1.2.1 user2.pm Part of results ****************************************** 136218 SEND<>MONEY<>1 1.0000 1 1 1 loftiest<>pulpit<>1 1.0000 1 1 1 revised<>rates<>1 1.0000 1 1 1 irreclaimable<>givers<>1 1.0000 1 1 1 Petite<>Provence<>1 1.0000 1 1 1 HIS<>WIFE<>1 1.0000 3 3 3 NEW<>FEMININE<>1 1.0000 3 3 3 Revenue<>Service<>1 1.0000 1 1 1 CLEARING<>HOUSE<>1 1.0000 3 3 3 knicker<>bockers<>1 1.0000 1 1 1 . . . (possible cut off point) is<>because<>1354 0.0005 2 2624 92 public<>as<>1354 0.0005 1 115 976 ,<>determined<>1354 0.0005 1 8193 14 ,<>trained<>1354 0.0005 1 8193 14 and<>equal<>1354 0.0005 1 3960 29 book<>a<>1354 0.0005 1 38 2975 the<>large<>1354 0.0005 2 8202 29 world<>to<>1354 0.0005 7 253 3528 house<>a<>1354 0.0005 1 38 2975 none<>the<>1354 0.0005 1 14 8202 to<>foreign<>1354 0.0005 1 3528 32 ,<>current<>1354 0.0005 1 8193 14 case<>.<>1354 0.0005 1 27 4219 them<>an<>1354 0.0005 1 222 507 of<>brilliant<>1354 0.0005 1 5668 20 early<>,<>1354 0.0005 1 14 8193 Project<>.<>1354 0.0005 1 27 4219 that<>money<>1355 0.0004 1 1912 62 her<>an<>1355 0.0004 1 229 507 . . . .<>.<>1536 -0.0307 5 4219 4219 to<>of<>1537 -0.0337 1 3528 5668 in<>,<>1538 -0.0349 9 2828 8193 of<>and<>1539 -0.0358 1 5668 3960 ,<>of<>1540 -0.0365 105 8193 5668 of<>.<>1541 -0.0366 3 5668 4219 .<>of<>1542 -0.0370 1 4219 5668 a<>,<>1543 -0.0376 1 2975 8193 and<>,<>1544 -0.0388 27 3960 8193 to<>,<>1545 -0.0401 6 3528 8193 .<>,<>1546 -0.0447 3 4219 8193 of<>,<>1547 -0.0505 14 5668 8193 the<>,<>1548 -0.0639 1 8202 8193 (the end) ************************************** There is a possible cutoff point. Before that, two words of most bigrams don't appear a lot in this text or in English but after that, there are mostly familiar words. It means phi coefficient method is good at distinguishing individual high frequency words from low ones. 1.2.2 tmi.pm Part of results ************************************** 136218 ,<>and<>1 0.0302 1533 8193 3960 of<>the<>2 0.0207 1392 5668 8202 .<>The<>3 0.0202 573 4219 659 .<>It<>4 0.0170 472 4219 508 in<>the<>5 0.0127 783 2828 8202 it<>is<>6 0.0112 414 1667 2624 It<>is<>7 0.0103 288 508 2624 ,<>but<>8 0.0092 352 8193 501 . . . much<>depended<>60 0.0001 1 192 4 reservoir<>of<>60 0.0001 2 2 5668 expensive<>mistake<>60 0.0001 1 6 23 the<>bench<>60 0.0001 3 8202 4 had<>crowded<>60 0.0001 1 236 5 s<>actual<>60 0.0001 1 119 10 it<>certainly<>60 0.0001 3 1667 21 two<>million<>60 0.0001 1 60 7 (possible cutoff point1) born<>into<>60 0.0001 2 37 250 popular<>poet<>60 0.0001 1 48 17 irreclaimable<>givers<>60 0.0001 1 1 1 facades<>;<>60 0.0001 1 1 660 the<>spirits<>60 0.0001 3 8202 3 Petite<>Provence<>60 0.0001 1 1 1 consistent<>ourselves<>60 0.0001 1 5 23 . . . artist<>only<>61 0.0000 1 9 272 that<>flower<>61 0.0000 1 1912 12 or<>dignity<>61 0.0000 1 739 11 reports<>,<>61 0.0000 3 15 8193 soon<>have<>61 0.0000 1 18 697 world<>if<>61 0.0000 1 253 354 fruit<>,<>61 0.0000 2 6 8193 nature<>into<>61 0.0000 1 118 250 ,<>harmony<>61 0.0000 1 8193 11 (possible cutoff point2) ,<>not<>62 -0.0001 56 8193 1076 as<>of<>62 -0.0001 10 976 5668 life<>the<>62 -0.0001 5 360 8202 .<>of<>62 -0.0001 1 4219 5668 for<>that<>62 -0.0001 5 939 1912 is<>for<>62 -0.0001 9 2624 939 .<>,<>62 -0.0001 3 4219 8193 for<>,<>62 -0.0001 6 939 8193 has<>,<>62 -0.0001 3 518 8193 if<>,<>62 -0.0001 2 354 8193 (the end) ************************************ Possible cutoff point 1 Before that, bigrams have some sort of strong dependence. After that, there are some low appearing words and don't show much dependence. Possible cutoff point 2 Before that, one word in bigram appears a lot but another one appears much less times. After that, both words are high frequency words. They are both pretty much independent. This tells that phi coefficient method is better at distinguishing highly used words from seldom used words and tmi method more emphasizes on dependence and independence. 1.3 RANK COMPARISON Run rank.pl cwces10.ll cwces10.tmi we can get Rank correlation coefficient = 0.4412 This means log likelihood method and tmi have some match although not perfect match. Run rank.pl cwces10.ll cwces10.user2 we can get Rank correlation coefficient = 0.9182 This means phi coefficient method has a very good match with log likelihood method. Run rank.pl cwces10.mi cwces10.tmi we can get Rank correlation coefficient = 0.3310 This means mi method and tmi method have some degree of match but not much. Run rank.pl cwces10.mi cwces10.user2 we can get Rank correlation coefficient = 0.9629 This means mi method and phi coefficient method have almost perfect match! 1.4 OVERALL RECOMMENDATION Phi coefficient method is a good measure of association for 2 by 2 tables. From rank program, we can see Phi coefficient method and mi method and log likelihood method have great match and all emphasize on distinguishing familiar words from unfamiliar ones and should be better at measuring independence than dependence. "True mutual information" method emphasizes more on dependence-independence issue, and it is better at detecting dependence. So tmi.pl is the best among those 4 for identifying significant collocations in large corpora of text. Therefore according to the results we want from testing, we can choose among them. %%%%%%%%%%%%%% EXPERIMENT 2 2.1 TOP 50 COMPARISON 2.1.1 user3.pm Using user3.pm (namely, phi coefficient method) to run NSP Part of results ****************************************************** 136217 rt<>lsm<>r<>1 369.0745 1 1 1 1 1 1 1 Sauve<>qui<>peut<>1 369.0745 1 1 1 1 1 1 1 Free<>Plain<>Vanilla<>1 369.0745 1 1 1 1 1 1 1 Vanilla<>Electronic<>Texts<>1 369.0745 1 1 1 1 1 1 1 Porte<>du<>Pont<>1 369.0745 1 1 1 1 1 1 1 US<>Internal<>Revenue<>1 369.0745 1 1 1 1 1 1 1 Plain<>Vanilla<>Electronic<>1 369.0745 1 1 1 1 1 1 1 MEDIUM<>IT<>MAY<>1 369.0745 1 1 1 1 1 1 1 demolishing<>Notre<>Dame<>1 369.0745 1 1 1 1 1 1 1 morituri<>te<>salutamus<>1 369.0745 1 1 1 1 1 1 1 maidenly<>screams<>waken<>1 369.0745 1 1 1 1 1 1 1 quid<>pro<>quo<>1 369.0745 1 1 1 1 1 1 1 Revenue<>Service<>IRS<>1 369.0745 1 1 1 1 1 1 1 debased<>Austrian<>coin<>1 369.0745 1 1 1 1 1 1 1 mensa<>et<>thoro<>1 369.0745 1 1 1 1 1 1 1 Internal<>Revenue<>Service<>1 369.0745 1 1 1 1 1 1 1 Parc<>aux<>Cerfs<>1 369.0745 1 1 1 1 1 1 1 25<>Short<>Studies<>1 369.0745 1 1 1 1 1 1 1 WHOM<>SHAKESPEARE<>WROTE<>1 369.0745 1 1 1 1 1 1 1 b<>rt<>lsm<>1 369.0745 1 1 1 1 1 1 1 troublesome<>fly<>sheet<>1 369.0745 1 1 1 1 1 1 1 du<>Pont<>Tournant<>1 369.0745 1 1 1 1 1 1 1 foul<>oaths<>tore<>1 369.0745 1 1 1 1 1 1 1 La<>Petite<>Provence<>1 369.0745 1 1 1 1 1 1 1 UNDER<>STRICT<>LIABILITY<>2 369.0738 1 2 1 1 1 1 1 . . . the<>higher<>would<>7 369.0732 1 8202 68 405 36 33 1 may<>be<>refined<>7 369.0732 1 301 1183 6 114 1 1 inevitable<>remedy<>is<>7 369.0732 1 6 13 2624 1 1 1 other<>introduction<>than<>7 369.0732 1 207 4 334 1 5 1 sound<>.<>In<>7 369.0732 1 12 4219 163 2 1 143 says<>that<>this<>7 369.0732 1 16 1912 743 3 1 37 services<>announced<>,<>7 369.0732 1 4 1 8193 1 1 1 for<>dancing<>,<>7 369.0732 1 939 2 8193 1 106 1 meteors<>were<>expected<>7 369.0732 1 3 236 29 1 1 2 Bench<>,<>and<>7 369.0732 1 3 8193 3960 1 1 1533 an<>incident<>,<>7 369.0732 2 507 7 8193 3 38 2 to<>the<>entrance<>7 369.0732 1 3528 8202 4 523 1 3 a<>charm<>sometimes<>7 369.0732 1 2975 27 33 3 1 1 entirely<>naturalized<>?<>7 369.0732 1 10 6 397 1 1 2 Upon<>the<>theory<>7 369.0732 1 5 8202 22 3 1 7 at<>rest<>,<>7 369.0732 1 377 35 8193 2 42 1 the<>American<>mind<>7 369.0732 1 8202 134 121 62 19 1 (the end) ******************************************************** From the result, we can see phi coefficient method is not very good at measuring 3-way tables. It is not sensitive enough since that is chi-square statistic divided by total numbers of trigrams, then take square root. So the phi values are small. From this we can also see Pearson's chi-square method is better than phi coefficient method on 3-way tables since the value is much bigger so much more sensitive on ranking. For phi coefficient method we only ranked to No. 7. Top ranks are similar to experiment 1. They all appear only once in the text and most are infrequency appearing words so it doesn't really mean they are dependent. the bottom rows show independence very well. 2.1.2 ll3.pm Using ll3.pm (namely, log likelihood method) to run NSP Part of results ********************************************* 136217 the<>world<>of<>1 -3215299.9030 1 8202 253 5668 169 1988 10 the<>sort<>of<>2 -3215537.4462 5 8202 104 5668 8 1988 86 the<>most<>of<>3 -3215540.6047 4 8202 229 5668 126 1988 24 the<>public<>of<>4 -3215791.9803 1 8202 115 5668 65 1988 1 .<>It<>is<>5 -3215837.6590 266 4219 508 2624 472 461 288 the<>best<>of<>6 -3215841.6542 3 8202 85 5668 52 1988 4 the<>whole<>of<>7 -3215856.6467 1 8202 54 5668 40 1988 1 the<>part<>of<>8 -3215863.4397 2 8202 49 5668 3 1988 34 the<>highest<>of<>9 -3215865.4608 1 8202 48 5668 37 1988 1 the<>negro<>of<>10 -3215879.2826 1 8202 61 5668 39 1988 1 the<>first<>of<>11 -3215879.9487 2 8202 66 5668 41 1988 3 the<>South<>of<>12 -3215882.2250 1 8202 45 5668 34 1988 1 the<>pursuit<>of<>13 -3215902.5578 16 8202 36 5668 20 1988 25 the<>number<>of<>14 -3215904.5503 10 8202 33 5668 11 1988 26 the<>end<>of<>15 -3215917.5920 15 8202 59 5668 26 1988 26 the<>development<>of<>16 -3215922.1584 19 8202 60 5668 21 1988 29 the<>young<>of<>17 -3215924.7252 1 8202 111 5668 41 1988 1 the<>sense<>of<>18 -3215927.3588 3 8202 47 5668 4 1988 26 the<>production<>of<>19 -3215940.9602 14 8202 24 5668 14 1988 17 the<>kind<>of<>20 -3215948.4560 5 8202 35 5668 6 1988 21 the<>mass<>of<>21 -3215949.1331 9 8202 30 5668 16 1988 16 the<>amount<>of<>22 -3215949.5086 5 8202 19 5668 6 1988 16 the<>matter<>of<>23 -3215949.6262 4 8202 56 5668 11 1988 23 the<>view<>of<>24 -3215950.7291 1 8202 38 5668 2 1988 20 the<>power<>of<>25 -3215950.9725 12 8202 87 5668 22 1988 28 the<>habit<>of<>26 -3215952.1156 4 8202 36 5668 7 1988 20 the<>necessity<>of<>27 -3215952.3455 11 8202 22 5668 12 1988 16 the<>other<>of<>28 -3215953.9700 1 8202 207 5668 45 1988 1 the<>fact<>of<>29 -3215955.5270 1 8202 66 5668 28 1988 3 the<>stage<>of<>30 -3215957.3509 1 8202 26 5668 18 1988 3 the<>point<>of<>31 -3215957.6134 3 8202 30 5668 7 1988 17 the<>imagination<>of<>32 -3215959.7609 2 8202 27 5668 19 1988 2 the<>possession<>of<>33 -3215960.0980 7 8202 19 5668 7 1988 15 the<>writer<>of<>34 -3215960.9647 1 8202 38 5668 21 1988 3 the<>rest<>of<>35 -3215963.5930 14 8202 35 5668 16 1988 14 the<>spirit<>of<>36 -3215966.7353 11 8202 83 5668 17 1988 26 the<>result<>of<>37 -3215969.0320 10 8202 38 5668 17 1988 15 the<>knowledge<>of<>38 -3215970.5332 1 8202 66 5668 2 1988 21 the<>state<>of<>39 -3215974.9168 2 8202 35 5668 5 1988 16 the<>standard<>of<>40 -3215975.4003 7 8202 26 5668 8 1988 15 the<>chief<>of<>41 -3215976.2044 1 8202 22 5668 15 1988 1 the<>effect<>of<>42 -3215977.2412 9 8202 52 5668 19 1988 15 the<>author<>of<>43 -3215977.2424 2 8202 25 5668 16 1988 2 the<>millions<>of<>44 -3215977.8912 2 8202 17 5668 3 1988 12 the<>influence<>of<>45 -3215978.3595 11 8202 36 5668 15 1988 14 the<>line<>of<>46 -3215979.1140 8 8202 33 5668 13 1988 14 the<>conception<>of<>47 -3215980.2239 1 8202 19 5668 1 1988 12 the<>study<>of<>48 -3215980.4541 9 8202 33 5668 9 1988 15 the<>form<>of<>49 -3215980.7319 10 8202 46 5668 10 1988 17 the<>recognition<>of<>50 -3215981.0518 3 8202 12 5668 4 1988 10 . . . thing<>,<>the<>85636 -3220712.5856 1 79 8193 8202 6 8 475 possible<>,<>the<>85637 -3220712.6360 1 42 8193 8202 4 4 475 life<>which<>the<>85638 -3220712.6470 1 360 551 8202 3 23 34 ,<>the<>labor<>85639 -3220712.7652 1 8193 8202 36 475 3 3 ,<>the<>better<>85640 -3220712.7882 1 8193 8202 66 475 3 5 present<>,<>the<>85641 -3220712.9113 1 45 8193 8202 3 3 475 English<>,<>the<>85642 -3220713.4347 1 59 8193 8202 4 5 475 and<>men<>of<>85643 -3220713.6153 1 3960 238 5668 9 173 13 see<>,<>the<>85644 -3220713.9829 1 88 8193 8202 6 7 475 you<>,<>the<>85645 -3220714.5447 1 172 8193 8202 9 10 475 (end) ***************************************************************** One interesting thing here is that those trigrams are almost all "the ... of" and we know "the ... of" is a very high frequent appearing word and it's frequency is 1988 and it does show dependence between those two words. Compare the result with user3, we can see ll3 is much better at measuring dependence especially two words dependence. However, phi coefficient method is better at detecting independence. 2.2 CUTOFF POINT 2.2.1 user3.pm Part of results ************************************************************* . . . CULTURE<>TO<>ME<>6 369.0733 1 1 10 1 1 1 1 IDLENESS<>IS<>THERE<>6 369.0733 2 3 6 3 2 2 3 EVEN<>IF<>YOU<>6 369.0733 2 2 3 7 2 2 3 Country<>Parson<>feed<>6 369.0733 1 4 1 3 1 1 1 cd<>pub<>docs<>6 369.0733 1 2 2 2 1 1 2 SHAKESPEARE<>WROTE<>AS<>6 369.0733 1 1 1 9 1 1 1 (possible cutoff point) in<>slumber<>?<>7 369.0732 1 2828 1 397 1 6 1 one<>side<>or<>7 369.0732 1 437 23 739 4 8 1 night<>,<>with<>7 369.0732 1 44 8193 682 6 1 107 the<>stranger<>,<>7 369.0732 1 8202 4 8193 2 774 1 any<>competent<>pharmacist<>7 369.0732 1 325 9 1 1 1 1 In<>a<>piece<>7 369.0732 1 163 2975 11 13 1 6 should<>miss<>for<>7 369.0732 1 198 2 939 1 1 1 to<>uncover<>it<>7 369.0732 1 3528 1 1667 1 79 1 wide<>awake<>citizen<>7 369.0732 1 19 3 18 1 1 1 The<>mention<>of<>7 369.0732 1 660 2 5668 1 119 1 certain<>duties<>in<>7 369.0732 1 89 19 2828 3 5 1 . . . ************************************************************ Before that point, all three words of trigrams don't appear a lot but after that point, at least one word or a 2-word association has high frequency. It also means phi coefficient method is good at distinguishing between individual high and low frequency words. 2.2.2 ll3.pm Part of results ************************************** . . . f<>wheels<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>Newport<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>Venus<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>egalite<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>cooking<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>mines<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>expansion<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>Maine<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 of<>repartee<>and<>28664 -3220648.0608 1 5668 2 3960 1 267 1 (possible cutoff point) I<>wondered<>if<>28665 -3220648.0652 3 389 8 354 4 5 3 long<>as<>we<>28666 -3220648.0656 1 74 976 652 4 1 28 we<>call<>progress<>28667 -3220648.0662 1 652 35 24 9 1 1 form<>of<>conversations<>28668 -3220648.0681 1 46 5668 3 17 1 2 has<>given<>us<>28669 -3220648.0702 2 518 33 168 6 3 4 women<>,<>that<>28670 -3220648.0731 1 196 8193 1912 20 2 204 to<>him<>to<>28671 -3220648.0780 1 3528 163 3528 22 67 16 the<>Tuileries<>was<>28672 -3220648.0812 2 8202 12 586 11 44 2 when<>she<>wore<>28673 -3220648.0824 1 231 169 6 9 1 2 learning<>they<>would<>28674 -3220648.0859 1 14 515 405 1 1 18 a<>month<>,<>28675 -3220648.0966 1 2975 23 8193 9 252 2 It<>may<>actually<>28676 -3220648.0996 1 508 301 12 16 1 1 the<>country<>except<>28677 -3220648.1016 1 8202 96 39 30 6 1 exciting<>accumulation<>of<>28678 -3220648.1103 1 4 12 5668 1 3 9 to<>making<>a<>28679 -3220648.1114 1 3528 36 2975 2 153 3 ?<>He<>has<>28680 -3220648.1191 1 396 112 518 3 2 12 down<>to<>inspect<>28681 -3220648.1270 1 79 3528 2 17 1 2 of<>every<>human<>28682 -3220648.1298 2 5668 100 116 14 15 6 man<>who<>ever<>28683 -3220648.1354 1 203 379 72 13 1 2 eighteenth<>century<>to<>28684 -3220648.1357 1 5 25 3528 4 1 2 was<>of<>more<>28685 -3220648.1433 1 586 5668 409 3 5 2 about<>some<>of<>28686 -3220648.1444 1 211 208 5668 1 2 36 original<>conception<>of<>28687 -3220648.1456 1 17 19 5668 1 4 12 the<>next<>generation<>28688 -3220648.1457 2 8202 25 32 12 5 3 . . . ******************************************************************** Before that point, there is always a 2-word association having high frequency but after that point, no very high frequency for any 2-word association. It also means log likelihood method is good at detecting 2-word association. 2.3 OVERALL RECOMMENDATION Phi coefficient method is not very good for measuring trigrams. It's quite different from Pearson's chi-square method which should be much better on this. Log likelihood method is better at detecting 2-word association in trigrams. I searched on the web and books at math department and also consulted Dr. Kang James about a good method for both 2-way and 3-way measure of association. She suggested log linearmodel and said there were not many measures other than pearson's chi-square, log likelihood and Fisher's method?? Log linearmodel is complicated and I haven't found a good model to implement it so I decided to still use phi coefficient method. Until conducting this experiment, I then realized that is not a very good measure for trigrams although it is good for bigrams... ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ maha0084/ll3.pm0100666016600500004710000000763207551507247011450 0ustar maha0084package ll3; require Exporter; @ISA = qw ( Exporter ); @EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString); # Aniruddha Mahajan - maha0084 # CS8761 Fall 2002 # NLP Assignment 2 # #-------------Log Likelihood Ratio for Trigrams ------------- sub initializeStatistic { ($ngrams, $totaltrigrams, $freqarrindex, $freqarr) = @_; } sub calculateStatistic { @arr = @_; #get array $tott = $totaltrigrams; # xyz x y z xy xz yz order of arguments as received from statistic.pl $xyz = $arr[0]; $x = $arr[1]; $y = $arr[2]; $z = $arr[3]; $xy = $arr[4]; $xz = $arr[5]; $yz = $arr[6]; $xbaryz = $yz - $xyz; # calculate values of variables on cube of 2x2x2 + totals $xbarz = $z - $xz; # 27 values in all .. including totals of all 3 matrices $ybarz = $z - $yz; $xbarybarz = $xbarz - $xbaryz; $xybarz = $xz - $xyz; $xybar = $x - $xy; $xbary = $y - $xy; $xbar = $tott - $x; $xbarybar = $xbar - $xbary; $ybar = $tott - $y; $xyzbar = $xy - $xyz; $yzbar = $y - $yz; $xzbar = $x - $xz; $zbar = $tott - $z; $xybarzbar = $xybar - $xybarz; $xbaryzbar = $yzbar - $xyzbar; $xbarzbar = $zbar - $xzbar; $xbarybarzbar = $xbarzbar - $xbaryzbar; $ybarzbar = $zbar - $yzbar; #--------------------------------- #---Calculating-Expected-Values--- $exyz = ($x*$y*$z)/($tott*$tott); # expected values -> formula derived as shown...... $exbaryz = ($xbar*$y*$z)/($tott*$tott); # from Null Hypothesis ... P(xyz) = P(x)*P(y)*P(z) $exybarz = ($x*$ybar*$z)/($tott*$tott); # thus (exyz/tott) = (x/tott) * (y/tott) * (z/tott) $exbarybarz = ($xbar*$ybar*$z)/($tott*$tott); # exyz = x*y*z/tott*tott $exyzbar = ($x*$y*$zbar)/($tott*$tott); # $exbaryzbar = ($xbar*$y*$zbar)/($tott*$tott); # $exybarzbar = ($x*$ybar*$zbar)/($tott*$tott); $exbarybarzbar = ($xbar*$ybar*$zbar)/($tott*$tott); #--------------------------------- #---Calculate-Log-Likelihood-Ratios--- $llrr = 0; if($exyz!=0 && $xyz!=0) { $llr1 = $xyz/$exyz; # calculate log liklihood ratio of each of the 8 possible combinations of x,y,z $llr1 = log($llr1)/log(2); $llr1 = $xyz * $llr1; $llrr += $llr1; # llrr is summation of all such calculated values i.e. log likelihood ratio } if($exbaryz!=0 && $xbaryz!=0) { $llr2 = $xbaryz/$exbaryz; $llr2 = log($llr2)/log(2); $llr2 = $xbaryz * $llr2; $llrr += $llr2; } if($exybarz!=0 && $xybarz!=0) { $llr3 = $xybarz/$exybarz; $llr3 = log($llr3)/log(2); $llr3 = $xybarz * $llr3; $llrr += $llr3; } if($exbarybarz!=0 && $xbarybarz!=0) { $llr4 = $xbarybarz/$exbarybarz; $llr4 = log($llr4)/log(2); $llr4 = $xbarybarz * $llr4; $llrr += $llr4; } if($exyzbar!=0 && $xyzbar!=0) { $llr5 = $xyzbar/$exyzbar; $llr5 = log($llr5)/log(2); $llr5 = $xyzbar * $llr5; $llrr += $llr5; } if($exbaryzbar!=0 && $xbaryzbar!=0) { $llr6 = $xbaryzbar/$exbaryzbar; $llr6 = log($llr6)/log(2); $llr6 = $xbaryzbar * $llr6; $llrr += $llr6; } if($exybarzbar!=0 && $xybarzbar!=0) { $llr7 = $xybarzbar/$exybarzbar; $llr7 = log($llr7)/log(2); $llr7 = $xybarzbar * $llr7; $llrr += $llr7; } if($exbarybarzbar!=0 && $xbarybarzbar!=0) { $llr8 = $xbarybarzbar/$exbarybarzbar; $llr8 = log($llr8)/log(2); $llr8 = $xbarybarzbar * $llr8; $llrr += $llr8; } $llrr = 2 * $llrr; return($llrr); } maha0084/tmi.pm0100666016600500004710000000537007551470721011540 0ustar maha0084package tmi; require Exporter; @ISA = qw ( Exporter ); @EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString); # Aniruddha Mahajan # CS8761 Fall 2002 # NLP Assignment 2 # #---------------------------True Mutual Information---------Bigrams----------------- sub initializeStatistic { ($ngrams, $totalbigrams, $freqarrindex, $freqarr) = @_; #initialize variables as passed by statistic.pl } sub calculateStatistic { @arr = @_; #get array $totb = $totalbigrams; #----------- $xy = $arr[0]; # no. of times xy bigram occurs $x = $arr[1]; # no. of times x occurs $y = $arr[2]; # no. of times y occurs $xybar = $x - $xy; # calculate rest of constituents of matrix $xbary = $y - $xy; # alongwith totals $ybar = $totb - $y; $xbar = $totb - $x; $xbarybar = $ybar - $xybar; #----------- # now calculate P(m,n)*log[P(m,n)/P(m)*P(n)] # for all constituents of the matrix $Pxy = $xy/$totb; # P(x,y) $Px = $x/$totb; # P(x) $Py = $y/$totb; # P(y) if($Px!=0 && $Py!=0) { $f1 = $Pxy/$Px; $f1 = $f1/$Py; if($f1!=0) { $f1 = log($f1)/log(2); # Pointwise MI } $f1 = $Pxy * $f1; # for True Mutual Info $ff = $f1; } #----------- $Pxybar = $xybar/$totb; $Pybar = $ybar/$totb; if($Px!=0 && $Pybar!=0) { $f2 = $Pxybar/$Px; $f2 = $f2/$Pybar; if($f2!=0) { $f2 = log($f2)/log(2); } $f2 = $Pxybar * $f2; $ff += $f2; } #----------- $Pxbary = $xbary/$totb; $Pxbar = $xbar/$totb; if($Pxbar!=0 && $Py!=0) { $f3 = $Pxbary/$Pxbar; $f3 = $f3/$Py; if($f3!=0) { $f3 = log($f3)/log(2); } $f3 = $Pxbary * $f3; $ff += $f3; } #----------- $Pxbarybar = $xbarybar/$totb; if($Pxbar!=0 && $Pybar!=0) { $f4 = $Pxbarybar/$Pxbar; $f4 = $f4/$Pybar; if($f4!=0) { $f4 = log($f4)/log(2); } $f4 = $Pxbarybar * $f4; $ff += $f4; } #----------- $mi = $ff; # True Mutual Information return($mi); } maha0084/user2.pm0100666016600500004710000000424307551573606012013 0ustar maha0084package user2; require Exporter; @ISA = qw ( Exporter ); @EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString); # Aniruddha Mahajan - maha0084 # CS8761 Fall 2002 # NLP Assignment 2 #-----------------------User2 -- Russell & Rao's Coefficient ---------------------- sub initializeStatistic { ($ngrams, $totalbigrams, $freqarrindex, $freqarr) = @_; #initialize variables as passed by statistic.pl } sub calculateStatistic { @arr = @_; #get array $totb = $totalbigrams; #----------- $xy = $arr[0]; # no. of times xy bigram occurs $x = $arr[1]; # no. of times x occurs $y = $arr[2]; # no. of times y occurs $xybar = $x - $xy; $xbary = $y - $xy; $ybar = $totb - $y; $xbar = $totb - $x; $xbarybar = $ybar - $xybar; #----------- # Russell & Rao's Coefficient #----------- $R = ($xy)/($xy+$xybar+$xbary+$xbarybar); # Russell & Rao's Coefficient return($R); } #Russell & Rao's Coefficient = a/(a+b+c+d) # i.e. R = 2(xy)/(xy+xbary+xybar+xbarybar) # # To illustrate difference between Dice and Russell & Rao's coefficient... # Conjoint absence is taken into consideration in the denominator whereas it is not so in Dice # # I compared the 2 to each other and also to leftFisher # Here are the results --- # Rank dice user2 -0.88065 --> Dice and Russle-Rao's tests are different # Rank dice leftFisher # Rank user2 leftFisher # # user2 i.e.Russell & Rao's Coefficient produces good results with respect to collocation hunting. # # One reference to this coefficient is found at -- # http://www.soziologie.wiso.uni-erlangen.de/koeln/script/chap3.doc # # I have also implemented Jaccard's Test. # Jaccard's coefficients is different from dice in the fact that in dice conjoint presence is # given double weightage .. whereas in Jaccard's coefficient calculation it is not so. # It is similar to Dice coefficient # # maha0084/user3.pm0100666016600500004710000000264607551573555012024 0ustar maha0084package user3; require Exporter; @ISA = qw ( Exporter ); @EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString); sub initializeStatistic { ($ngrams, $totaltrigrams, $freqarrindex, $freqarr) = @_; } sub calculateStatistic { @arr = @_; #get array $tott = $totaltrigrams; # xyz x y z xy xz yz $xyz = $arr[0]; $x = $arr[1]; $y = $arr[2]; $z = $arr[3]; $xy = $arr[4]; $xz = $arr[5]; $yz = $arr[6]; $xbaryz = $yz - $xyz; $xbarz = $z - $xz; $ybarz = $z - $yz; $xbarybarz = $xbarz - $xbaryz; $xybarz = $xz - $xyz; $xybar = $x - $xy; $xbary = $y - $xy; $xbar = $tott - $x; $xbarybar = $xbar - $xbary; $ybar = $tott - $y; $xyzbar = $xy - $xyz; $yzbar = $y - $yz; $xzbar = $x - $xz; $zbar = $tott - $z; $xybarzbar = $xybar - $xybarz; $xbaryzbar = $yzbar - $xyzbar; $xbarzbar = $zbar - $xzbar; $xbarybarzbar = $xbarzbar - $xbaryzbar; $ybarzbar = $zbar - $yzbar; #------------------------- $R = ($xyz)/ ( ($xyz+$xbarybar+$xbaryz+$xyzbar+$xybarz+$xbaryzbar+$xbarybarz+$xybarzbar)); # Russell & Rao's Coefficient return($R); } # Russell & Rao's Coefficient is extended for trigrams easily due to linear relation in the calculation of the constituents which contribute towards # the coefficient. # # # # # From maha0084@d.umn.edu Sun Oct 20 20:48:25 2002 Received: from gerard.d.umn.edu (gerard.d.umn.edu [131.212.41.24]) by mail.d.umn.edu (8.11.6/8.11.6) with ESMTP id g9L1lu829215 for ; Sun, 20 Oct 2002 20:47:56 -0500 (CDT) Received: from localhost (maha0084@localhost) by gerard.d.umn.edu (8.11.6+Sun/8.11.6) with ESMTP id g9L1ltU01253 for ; Sun, 20 Oct 2002 20:47:55 -0500 (CDT) Date: Sun, 20 Oct 2002 20:47:55 -0500 (CDT) From: Aniruddha S Mahajan To: ted pedersen Subject: Re: experiments.txt In-Reply-To: <200210202055.g9KKt9R28080@csdev04.d.umn.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO Hi Ted ... Thanks for letting me know ... I am at a complete loss to explain how it happened. I have several backup files and i am pasting my experiments.txt file here. I owe you one... Thanks, Andy # Aniruddha Mahajan # CS8761 Fall2002 # Assignment 2 # experiments.txt # Introduction: We have to identify 2 or 3 word collocations from a large body of text. A collocation is a sequence of words that can be viewed as syntactic or semantic units. Identification of proper collocations is important as we need collocations which are semantically right. For example ",<>and" is a bigram and ",<>and<>the" is a trig ram, but none among these can be counted as a collocation as they have no 'proper' me aning in the English language. The meaning of a collocation is not completely defined by simply co-joining the meanings of it's constituents. Thus we want to identify collocat ions from the list of bigrams/trigrams. It can be said that the test or coefficient used s hould assign higher scores to such bigrams/trigrams. A good test should throw up collocations or words which occur t ogether (much?) more frequently than separately. A corpus of 1.1 million words was used for all tests. This was t aken from the Brown Corpora. The NSP package developed by Ted Pedersen and Satanjeev Baner jee was used to implement all tests on the said corpus. **************************** Experiment 1 **************************** TRUE MUTUAL INFORMATION ^^^^^^^^^^^^^^^^^^^^^^^ To implement "True Mutual Information" for 2 word sequences. True Mutual Informa tion is calculated using the formula -- TMI = Summation[ P(XiYi) * log(P(XiYi)/P(X)P(Y)) ] where log is to the base 2 and X and Y take the values {x,xbar} and {y,ybar} res pectively. Upon implementing tmi and executing NSP we get a list of bigrams sorted by rank. These bigrams reveal some collocations amongst a large number of bigrams which cannot be classified as collocations. Some collocations like "New York" "U S" (because of the periods U.S.) "United States" "years ago" "World War" "Los Angeles" "fiscal year" "Gener al Motors" "Civil War" "Puerto Rico" "anti Semitism" etcetra appear in the top 50 ranks of this file. Many bigrams like ". The" "of the" "in the" ", but" "it is" ". He" do not offer much useful information to us in this regard. However we do learn that tmi gives us top occu ring bigrams and not collocations. Method of execution and output ... perl statistic.pl tmi.pm tmiop2.txt ctop2.txt gives the output....(corpus of 1.1 Million words) .<>The<>1 0.0142 4366 40218 6134 of<>the<>2 0.0109 8429 32697 54968 ,<>and<>5 0.0047 4593 49461 23745 United<>States<>8 0.0033 341 450 441 New<>York<>15 0.0025 256 522 28 number<>of<>27 0.0012 331 452 32697 years<>ago<>31 0.0008 103 845 204 of<>his<>33 0.0006 616 32697 4901 Peace<>Corps<>34 0.0005 46 60 77 USER2 (Russel & Rao's Coefficient) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To calculate Russel & Rao's coefficient for bigrams. R = a / (a + b + c + d) where R -> Russel & Rao's coefficient a -> xy b -> xybar c -> xbary d -> xbarybar Although the formula is similar to that os Dice coefficient... the two are very different. Executing rank.pl on dice and rao gives a correlation coefficient of -0.8805 whi ch means that the two coefficients are quite opposite. Conjoint absence is taken into co nsideration in the denominator whereas it is not so in Dice. Russell-Rao coefficient does no t produce many collocations in the very top ranks, but they do appear in the top 50. "numb er of" "United States" "Chinese Puzzle" "ultramarine blue" "straight hair""Constable He nry" are some. However the Russell-Rao coeffecient produced a fair number of collocations per a set of trigrams in the lower ranks. The Russell-Rao coefficient is always less than 1. (except when all the corpus consists of is the only bigram). perl statistic.pl user2.pm user2op2.txt ctop2.txt gives output like -- (corpus of 1.1 Million words) of<>the<>1 0.0074 8429 32697 54968 in<>the<>2 0.0042 4729 17359 54968 he<>had<>18 0.0004 439 4945 3526 number<>of<>19 0.0003 331 452 32697 United<>States<>19 0.0003 341 450 441 psychiatric<>knowledge<>22 0.0000 1 5 132 nuclear<>age<>22 0.0000 1 110 208 poem<>by<>22 0.0000 2 43 4754 EXAMPLE of<>the<>1 0.0074 8429 32697 54968 the | thebar | --------------|----------|--------- of 8429 | 24268 | 32697 | | ofbar 46539 | 1059081 |1105620 --------------|----------|---------- 54968 | 1083349 |1138317 of the --> R = 8429/(8429+24268+46539+1059081) R = 0.0074 Thus calculated and observed value of Russel & Rao's Coeffecint is same. Top 50 Comparison ^^^^^^^^^^^^^^^^^ Both the tests give similar results .. i.e. in output of both there are some bigrams which can be called collocations whereas there are many bigrams which ca nnot. One more similarity is that the Russell-Rao coefficient and tmi both have low va lues (below 1). thus in the whole CORPUS we get only 39 ranks in tmi and 22 in user2. In both the tests the lower ranks contain more collocations. Cutoff Point ^^^^^^^^^^^^ There is no such cut-off point observed in these tests. Both the tests are una ble to pick out the collocations from the bigrams. Though we can say that a cut-off poi nt does exist near the end where lots of collocations are found -- for both tmi and user 2. There does exist a cut-off point for ll.pm - after like 500 ranks collocatio ns begin to appear in the output and they become prominant in the output after 8000 ranks . Comparing ll , mi, tmi, user2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rank ll tmi -0.9613 rank ll user2 -0.9880 rank mi tmi -0.9613 rank mi user2 -0.9878 rank dice user2 -0.8805 rank tmi user2 0.9999 Overall Recommendation ^^^^^^^^^^^^^^^^^^^^^^ I think (after observing results from tests performed on CORPUS) mi is the b est one for searching for collocations (in a corpus of similar size). ll is also goo d but the very top ranks are dominated by bigrams which cannot be classified as collocatio ns. E.g. ".<>The<>" "of<>the<>" ",<>but<>". ******************************* Experiment 2 ******************************* Log Likelihood for Trigrams ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Log Likelihood was to be implemented for trigrams. p(W1)p(W2)p(w3) = p(W1,W2,W3) is the given Null Hypothesis to implement ll3. Using the same ... P(xyz) = P(x)P(y)P(z) exyz/tott = x/tott * y/tott * z/tott exyz = (x*y*z)/(tott*tott) Where exyz is expected value of xyz. Thus we get formula for expected value in Log Likelihood ratio. Now, Log Likelih ood can be calculated by the following formula ... L = 2* Summation[XYZ log(XYZ/eXYZ)] where.. L = Log Likelihood Ratio XYZ = Observed value of XYZ eXYZ = Estimated value of XYZ --> X,Y,Z take values (x,xbar), (y,ybar), (z,zbar) respectively. --> log is taken to base 2 command used to run NSP on ll3... perl statistic.pl --ngram 3 ll3.pm ll3op3.txt ctop33.txt Results obtained were as follows -- Almost every trigram had a different rank. ll3 showed sound results with some collocations springing up in the top 50. In the later rank (>7000). ,<>.<>The<>1 36937.9893 7 49461 40218 6134 56 30 4366 of<>.<>The<>2 35662.0687 1 32697 40218 6134 18 1 4366 .<>The<>.<>3 34622.8955 1 40218 6134 40219 4366 293 1 Metropolitan<>Rhode<>Island<>52236 2005.5033 1 12 88 124 1 1 76 the<>first<>half<>52539 1958.5829 8 54968 1129 291 452 29 11 USER3 - Russell & Rao's coefficient ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Russell & Rao's coefficient was extended to Trigrams. The formula is linear so it can easily be extended to incorporate more variables. R = xyz / (xyz + xyzbar + xybarz + xbaryz + xybarzbar + xbaryzbar + xbarybarz + xbarybarzbar) One thing is that Russel-Rao's coefficient is always less than one and in thi s case it is pretty small .. thus we get very few ranks as NSP ranks extend only upto 4 decim al places. One solution may be to scale the output by a factor of 1000 or something. The ou tput produces results.. "front page stories" "day by day" "and later patented". Collocations w hich have a definite meaning are harder to get for trigrams in these cases atleast. perl statistic.pl --ngram 3 user3.pm user3op3.txt ctop33.txt Results obtained were .. ,<>and<>the<>1 0.0004 443 49461 23745 54968 4593 2388 1785 .<>It<>is<>1 0.0004 405 40218 1759 9559 1378 1029 526 the<>number<>of<>4 0.0001 84 54968 452 32697 106 7513 331 parked<>in<>front<>5 0.0000 3 28 17359 164 4 3 68 Day<>by<>day<>5 0.0000 1 52 4754 575 1 3 7 brief<>glimpses<>of<>5 0.0000 1 60 4 32697 1 6 3 Example ,<>and<>the<>1 0.0004 443 49461 23745 54968 4593 2388 1785 R = (443)/(443+1342+51238+1945+4150+42923+17810+1018466) xyz xbaryz xbarybarz xybarz xyzbar xybarzbar xbaryzbar xbarybarzba r R = 443 / 1138317 R = 0.00038 ~ 0.004 Thus Estimated value equals observed value Top 50 Comparison ^^^^^^^^^^^^^^^^^ Both the tests (ll3 and user3) start off with giving highest ranks to trigra ms which occur most frequently. Trigrams with punctuation marks, articles like 'the' and 'a' also the word 'and' occurs most frequently in the top ranks. After that however around the 5th rank (user 3 has limited ranks .. b ut still it is quite early) user3 i.e. Russell & Rao's coefficient shows some collocations and trigrams which are not exactly collocations but are 'in betweens'. ll3 continues to be partial towards articles, "and" etc. ll3 shows decent results prety late - but then it has got a different rank fo r almost every trigram. In the top 50 (with scaling for user3) user3 gives more collocations and reasonable wor d-sequences than ll3. Cutoff Point ^^^^^^^^^^^^ Again, without scaling user3 has a cutoff point of 5 ... which is around 800 0 trigrams out of a million. After this point it begins to show meaningful word sequences with collocations s cattered around. I did not find any order in the occurance of collocations as such.. but then again 3 word sequences which could be called collocations were themselves very less. ll3 also exhibits cut-off but gradually over ranks in the range of 10000. Rank Comparison ^^^^^^^^^^^^^^^ I am using this comparison to show that user3 retained its qualities after e xtension for trigrams. Also to show that Russell-Rao's and dice coefficients are different in nature. Rank user2 dice -0.8805 --> user2 & dice are different in natu re Rank ll user2 -0.9994 Rank ll3 user3 -0.9880 --> user3 retained correlation coeffic ient thus implementation is valid Overall Recommendation ^^^^^^^^^^^^^^^^^^^^^^ It is difficult to pick out one of the two tests. Neither of them is very goo d at selecting collocations from the corpus. Still user3 has a definite cut-off point, while ll3 has a more gradual change o f trigram quality. I would pick Russell & Rao's coefficient because I feel that it gives better collocations even if they are low ranked. --------------------------------------------------------------------------- On Sun, 20 Oct 2002, ted pedersen wrote: > > Hi Andy, > > Your experiments.txt file seems to be a tar file that consists of your > .pm files. Do you have a copy of your report somewhere? If you do, go > ahead and send it to me as plain text via email. Let me know if you > don't... > > Thanks, > Ted > > -- > # Ted Pedersen 726-8770 # > # tpederse@umn.edu http://www.umn.edu/~tpederse # > ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ Typed and Submitted by: Ashutosh Nagle Assignment 2: The Great Collocation Hunt Course:CS8761 Course Name: NLP Date: Oct 11, 2002. +-----------------------------------------------------------+ | Introduction | +-----------------------------------------------------------+ In this assignment, I am interested in the finding and implementing the tests of association for bigrams and trigrams.I have implemented 4 such tests and found out some interesting relation among some of these. ######################################################## # # # http://www.cs.brown.edu/~dpb/papers/dpb-colloc01.pdf # # # ######################################################## Given above is the url where I found this test. The test given here is applicable to all the ngrams. I have used this for both bigrams and trigrams. This test is based on "Log-linear model of n-way interaction" (as stated in section 2.3 of the above document). The above file gives 2 different measures of association namely, "unigram subtuples" and "all subtuples". I have used the "all subtuples" in my implementation. The details of how these measures are derived are available on the web. Given below are the details about how I calculate these measures based on the information in the above pdf file. Let the trigram be A1<>A2<>A3, in general. For a particular trigram a1<>a2<>a3, A1=a1, A2=a2 and A3=a3. A1, A2 and A3 are the random variables. The above said test considers all possible combinations of equalities and inqualities like this (i.eAi=ai or Ai != ai). It represents every such combination by a 3 bit integer, with a 1 for every equality and a 0 every inquality. For example for the above trigram, the test makes 8 such combinations like 1) (A1!=a1)&&(A2!=a2)&&(A3!=a3), 2) (A1!=a1)&&(A2!=a2)&&(A3=a3), 3) (A1!=a1)&&(A2=a2)&&(A3!=a3), ..........and so on. The 3-bit inter corresponding to the first combination above will be 000 = 0, second 001 = 1, third 010 = 2 (binary to decimal conversion), and so on. The next variable that the measure defines is Cb, which is the number of times the combination represented by a particular b occured in the input text. For example C5 would mean the number of times the combination (A1=a1)&&(A2!=a2)&&(A3=a3) occured in the input text because 5 represents the bit pattern 101. The point that I noted about this measure is it makes even the inequalities explicit. By that I mean when C5 is calculated, attention is paid not only to the occurance of A1=a1 and A3=a3, but also to the absence of A2=a2 i.e. A2!=a2. In NSP, the values that we get, include all the possibilities i.e. we get the number of trigrams in which (A1=a1 and A3=a3), but in these trigrams A2 may or may not be 'a2'. Because of this the value of C5 is computed as C5 = N(A1=a1 && A2=a2 && A3=a3) - N(A1=a1 && A3=a3) N represents the number of time the tuple inside the parentheses occured. Both the values are available from count.pl. Then the model computes lambda as Upperlimit UL = (2 raised to power n) - 1, where n is 3 for trigrams NoO = Number of 1's in a particular 'b'. Lambda = Summation | b=0 to UL { [(-1) raised to power (n-NoO)] * [log (cb)]} Please take a look at the formula on the web site. The last variable that the model defines before calculating the measure is sigma, which is equal to sigma = Square root of {Summation | b=0 to UL (1/cb) } Finally, the measure Mu is calculated as Mu=Lambda = 3.29 * sigma. ______________________________ Sample calculation for user3. ------------------------------ Let's take an example of the following values being passed from count.pl's output by statistic.pl. (2,3,2,2,2,2,2) = (c7, c4, c2, c1, c6, c5, c3) Say total number of bigrams in the file = 10. Here, the various values are Left most parentheses use the convention used in NSP. (0,1,2)=c111 = c7 ---- Because all the words occur. [2] (0)= c100 = c4 --------- But this contains values c4 + c5 + c6 + c7 as explained above [3] (1)= c010 = c2 --------- But this contains values c2 + c3 + c6 + c7 as explained above [2] (2)= c001 = c1 --------- But this contains values c1 + c3 + c5 + c7 as explained above [2] (0,1)= c110 = c6 ------- But this contains values c6 + c7 as explained above [2] (0,2)= c101 = c5 ------- But this contains values c5 + c7 as explained above [2] (1,2)= c011 = c3 ------- But this contains values c3 + c7 as explained above [2] let c0 = Total number of trigrams Purify the values by removing the additional factors c3 = c3 - c7 = 0; c5 = c5 - c7 = 0; c6 = c6 - c7 = 0; c1 = c1 - (c3 + c5 + c7) = 0; c2 = c3 - (c3 + c6 + c7) = 0; c4 = c4 - (c5 + c6 + c7) = 1; Now let's add the continuity correction factor 0.5 to all the c's Therefore, c1 = c2 = c3 = c5 = c6 = 0.5 c4 = 1.5 c7 = 2.5 Now let's calculate the number of i's in the binary representation of integers between 0 and 7. b0 = 000 = 0 b1 = 001 = 1 b2 = 010 = 1 b3 = 011 = 2 b4 = 100 = 1 b5 = 101 = 2 b6 = 110 = 2 b7 = 111 = 3 now let's make bi = n - bi, where n is 3 for trigrams b0 = 3 b1 = 2 b2 = 2 b3 = 1 b4 = 2 b5 = 1 b6 = 1 b7 = 0 Lambda = 0 sigma = 0 for i moving from 0 to 7 do { multiplier = (-1) to the power bi; term = log (ci) term = term * multiplier lambda = lambda + term sigma = sigma + sqrt(1/ci) } Using the above algorithm the values of term * multiplier for the various ci's are c0 --> -2.0149 c1 --> -0.6931 c2 --> -0.6931 c3 --> 0.6931 c4 --> 0.4054 c5 --> 0.6931 c6 --> 0.6931 c7 --> 0.9162 Adding the above values gives us lambda = 0 Similarly 1/ci values are c0 --> 0.1333 c1 --> 2 c2 --> 2 c3 --> 2 c4 --> 0.6666 c5 --> 2 c6 --> 2 c7 --> 0.4 Adding the above values gives us sigma = 11.2 Finally the measure Mu = lambda - 3.29 * sigma = 0 - 3.29 * 11.2 = -36.848 The same formula is applicable to bigrams. The only difference is that we have 4 c's and 4 b's instead of 8. ______________________________ Sample calculation for user2. ------------------------------ The example is for the values (2, 3, 2) = (x1, x2, x3) returned from count.pl and total #(bigramgs) = 11 Purifying the values c3 = x1 = 2 c2 = x2 - x1 = 3 - 2 = 1 c1 = x3 - x1 = 2 - 2 = 0 c0 = total - (c1 + c2 + c3) = 11 - (2 + 1 + 0) = 8 After addign the continuity correction factor 0.5 co = 8.5 c1 = 0.5 c2 = 1.5 c3 = 2.5 Then,(bi is the number of one's in the binary representation of i) b0 = 00 => 0 b1 = 01 => 1 b2 = 10 => 1 b3 = 11 => 2 n = 2 for bigrams n - b0 = 2 n - b1 = 1 n - b2 = 1 n - b3 = 0 the values of term * multiplier for all ci's are c0 --> 2.1400 c1 --> -0.6931 c2 --> 0.4054 c3 --> 0.9162 The sum of these values gives us lambda = 3.3440 For Mu: 1/c0 =0.1176 1/c1 = 2 1/c2 = 0.6667 1/c3 = 0.4 Sum of these values gives us sigma = 3.1843 Finally thea measure Mu = lambda - 3.29 sigma = 3.3440 - 3.29 * 3.1843 = -7.1323 +-----------------------------------------------------------+ | Experiment 1: bigrams | +-----------------------------------------------------------+ The coupus used is BTL.TXT from BROWNTAG TOP 50 COMPARISON: Looking at TOP 50 ranks in the files generated by user2 and tmi, I feel that user2 is much better in identifying the significant or interesting collocations. I have placed both these files in /home/cs/nagl0033/CS8761 directory. From the formula of tmi, I feel if p(x,y) is too high, it has a major impact on the rank of the bigram. Even when x and y occur only together say n times, the ratio of p(x,y) to [p(x)*p(y)] is 1/n. At all other times it has to be less than this. Depending on n we get some negative value (large or small) of log. We multiply this with p(x,y) and sum for all the bigrams. So I really do not see any correlation between this and the strength of the association between x and y. On the other hand, as per the information given in the above mentioned pdf file, the bigrams that score poorly on the above test are name like bigrams e.g. Nitin Sharma. The words in such bigrams occur only in the particular names. So if the first word occur we know what the next word is going to be. On the other hand, words that occur together as well as separately, quite often score well in the above test. It certainly makes sense that we are likely to be interested in such combinations, rather than the the ones whose words do not occur outside the combination. CUT OFF: I do not see any cut off in the output files. But I realise that bigrams given by tmi are initially totally meaningless, then some sensible ones again meaningless. In the output file of user2, I feel that the bigrams are mostly meaningful and they become meaningless deep down the file. RANK COMPARISONS: _______________________________________ | | user2 | tmi | --------------------------------------- |tmi | -0.8685 | | |mi | 0.8076 | -0.8688 | |dice | 0.8556 | -0.6971 | |x2 | 0.8486 | -0.8681 | |ll | 0.7277 | -0.8676 | |______ |______________ |_______________| Thus from these values, user2 is closer to all the tests, but tmi is not. It gives negative values in all the comparisons. This strengthens my belief that tmi should not be a measure of association. I made this belief after seeing the unrelated and meaningless bigrams in the output file of tmi. OVERALL RECOMMENDATIONS: From the experiment I conclude that tmi is not a good measure of association. Throught the experiment, tmi has always proved this. e.g. through its output file, through the negative reluts nearing -1 in the output of rank.pl. Regarding the remaining ones, I do not feel that some measure is much better than others because they all have given the values in the same range. Also there output files are not observed to have any outstanding feature. +-----------------------------------------------------------+ | Experiment 2: trigrams | +-----------------------------------------------------------+ TOP 50 COMPARISON: Looking at TOP 50 ranks in the files generated by user3 and ll3, I feel that user3 is much better in identifying the significant or interesting collocations. I have placed both these files in /home/cs/nagl0033/CS8761 directory. CUT OFF: In file generated by ll3, I get all useless trigrams near the begining of the file. e.g. a sample trigram with all its values is .<>SD11<>:<>81 68480.7650 11 17965 11 9813 11 9002 11 Most of the trigrams upto rank 84 (score 68393) are like this. But from here onwards I start seeing some meaningful words. With user3 I get meaningful values till rank 1265. RANK COMPARISONS: _______________________ | | user3 | ----------------------- |ll3 | 0.1315 | |______ |______________ | For this value I say, the two measures user3 and ll3 and not quite similar. Though they are not exactly opposite, they are not similar either. OVERALL RECOMMENDATIONS: From the experiment I conclude that user3 is a good measure of associaton as indicated by its output file. ll3 should also be good but may not be as good as user3. This conclusion is purely based on the output files. I understand that there could very well be some counter expample(s). Computation. In ll3 a value n-i,j,k has the same meaning except it is for trigrams. For the count.pl we get the following 8 values All the trigrams = n+++ f(0,1,2), f(0), f(1), f(2), f(0,1), f(0,2), f(1,2) respectively mean n000, n0++, n+0+, n++0, n00+, n0+0, n+00 from these I calculate the remaining 19 (27 - 8) values as n1++ = n+++ - n0++ n+1+ = n+++ - n+0+ n++1 = n+++ - n++0 These were the values with two ++. Similary for just 1 + n+01 = n+0+ - n+00 n+10 = n++0 - n+00 n+11 = n++1 - n+01 in the similar way n0+1,n1+0, n1+1, n01+, n10+, n11+ are computed. Now, n011 = n00+ - n000; Take any position containing a 1 in the bit stream (here position say 3 in 011). Change it to a '+' to get the first term in the right hand side and change it to a '0' to get the second term in the right hand side. Care is taken that the values are computed in such an order that they have all the required values computed before them. So any position with a 1 can be selected. Similarly, n001, n010, n100, n101, n110, n111 are compued. Once this is done eijk = (ni++)*(n+j+)*(n++k)/[(n+++)*(n+++)] This formula is derived in 2 steps from the null htpothesis given. P(x,y,z) = P(x)*P(y)*P(z); Once all the eijk and nijk are computed they are substituted in the formula G square = 2 Summation over all i,j,k (nijk * log (nijk/eijk)); Conclusion: I conclude from the above experiments, that tmi is not a good measure of association, but user2, user3 and ll3 are quite acceptable ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ CS 8761 NATURAL LANGUAUGE PROCESSING FALL 2002 ANUSHRI PARSEKAR pars0110@d.umn.edu ASSIGNMENT 2 : THE GREAT COLLOCATION HUNT Introduction ------------- The objective of this assignment is to find, implement and compare various measures of association for bigrams and trigrams, which can be used to identify collocations, in a large corpus. Collocations are sequences of words whose meaning is not same as the meaning obtained by joining the meanings of its constituent words. Various methods can be used to find out whether a sequence of words isa collocation or not.Comparing frequencies of sequences of words when they appear together and when they appear alone can give us a fair idea about the chances of that sequence is a collocation or not. If the words of a sequence appear alone very frequently as compared to appearing together, then it may not be a collocation. Based on this various measures of association such as pointwise mutual information, logliklihood ratio, dice's coefficeint etc. are used to identify collocations. Input details: All the four modules that were implemented were ran on a corpus obtained by concatenating files taken from the BROWN corpus. The total number of words in the corpus were 1039486. ************************ * Experiment no:1 * ************************ True Mutual Information ----------------------- Mutual information is the amount of information one random variable contains about other. In case of bigrams mutual information can be viewed as the reduction in uncertainity of occurence of a word knowing about the occurence of another word [1]. Hence if two words occur independent of each other their mutual information will be zero and if occurence of two words is highly dependent then their mutual information will be high. Therefore mutual information can be used as a measure of association. Mutual Information = summation over x,y (p(x,y) * log (p(x,y)/(p(x)p(y)) ) User2 - Mutual Expectation -------------------------- This test of association works on textual units(word or character or tag) for an n-gram. Mutual expectation evaluates the degree of cohesiveness that links together the textual units and is based on the concept of normalised expectation [2]. Normalised expectation refers to the expectation of occurence of a word knowing the presence of rest of the words in the n-gram.Intutively,NE is based on conditional probability, however for an n-gram the probability that a given word will occur knowing the rest (n-1) words has to be calculated. Therefore the concept of Fair point expectation is used which refers to the arithmetic mean of the n joint probabilities of the (n-1)-grams contained in the n-gram [2].Hence mutual expectation is calculated as [3] : mutual expectation = joint prob of n words * normalised expectation where, normalised expectation = joint prob of n words / fair point expectation Example: United<>States<>5 0.2275 343 453 443 Fair point expectation = (p(x)+ p(y))/2; = (453 + 443)/2 = 448 Normalised expectation = p(x,y)/FPE = 343/448 =0.765625 Mutual expectation = p(x,y) * NE = 343 / 1154563 * NE = 0.2275 (after scaling by 1000) Top 50 Comparisons -------------------- --------------------------------------------------------------------------------------------- part of output after running user2 part of output after running tmi ---------------------------------------------------------------------------------------------- United<>States<>5 262.6094 343 453 443 United<>States<>8 0.0033 343 453 443 New<>York<>8 162.4188 256 522 285 New<>York<>16 0.0024 256 522 285 number<>of<>197 6.6256 333 455 33018 number<>of<>28 0.0012 333 455 33018 even<>though<>198 6.6092 61 812 314 even<>though<>36 0.0004 61 812 314 --------------------------------------------------------------------------------------------- Comparing the results of tmi.pm and user2.pm it can be seen that both the tests lack to identify actual collocations.Proper nouns are given higher ranks in user2 as compared to tmi. However, user2 does not give high ranks to punctuation marks and prepositions as compared to tmi. The setback for user2 lies in the fact that it gives very high ranks to common phrases such as number of,even though. THerefore tmi seems to be better than user2 for callocation extraction. However there wasn't any cutoff point that could easily divide the output into significant and unsignificant group of words for both tests. Rank Comparison ---------------- The corelation coefficient between true mutual information and and pointwise mutual information was found to be -0.9618. --------------------------------------------------------------------------------------------- part of output after running mi part of output after running tmi ---------------------------------------------------------------------------------------------- Burly<>leathered<>1 20.1389 1 1 1 .<>The<>1 0.0141 4419 41005 6204 Capetown<>coloreds<>1 20.1389 1 1 1 of<>the<>2 0.0108 8506 33018 55619 Bietnar<>Haaek<>1 20.1389 1 1 1 in<>the<>3 0.0062 4764 17511 55619 aber<>keine<>1 20.1389 1 1 1 .<>He<>4 0.0057 1650 41005 2053 drown<>emanations<>1 20.1389 1 1 1 .<>It<>5 0.0047 1400 41005 1791 blandly<>boisterous<>1 20.1389 1 1 1 ,<>and<>5 0.0047 4647 50254 24023 United<>States<>8 0.0033 343 453 443 ----------------------------------------------------------------------------------------------- Pointwise mutual information extracts those pairs of words which may occur a few times individually butalways occur together. On the other hand, true mutual information(TMI) gives high score to those pairs of words which occur together for a large number of times and may occur individuallyfor a large no of times too (without the other word). TMI can identify proper nouns like United States, Rhode Island and terms like per cent, at least which occur together. However TMI gives high ranks to combinations of punctuation, articles and prepositions which are not given high ranks in pointwise mutual information. Hence it can be argued that both these tests of association work in highly different ways. The ranks obtained by tmi and log liklihood are nearly the same except that the scores by logliklihood test are far larger to that obtained by tmi.This is because log liklihood test can be thought about as a reformulation of mutual information. --------------------------------------------------------------------------------------------- part of output after running mi part of output after running user2 ---------------------------------------------------------------------------------------------- GHOST<>TOWN<>1 20.1389 1 1 1 GHOST<>TOWN<>955 1.0000 1 1 1 BOA<>CONSTRICTOR<>1 20.1389 1 1 1 boa<>constrictor<>152 0.0039 1 1 1 MELTING<>POT<>1 20.1389 1 1 1 MELTING<>POT<>182 0.0009 1 1 1 --------------------------------------------------------------------------------------------- Comparing the results of user2 with mi show that mi is more efficient in finding the collocations and collocations may get very high ranks in case of user2. However mi tends to give a single rank to a lot of pairs. The corelation coefficient is 0.6777. --------------------------------------------------------------------------------------------- part of output after running logliklihood part of output after running user2 ---------------------------------------------------------------------------------------------- of<>the<>2 17310.4232 8506 33018 55619 of<>the<>1 1.4140 8506 33018 55619 United<>States<>9 5281.8724 343 453 443 United<>States<>5 0.2275 343 453 443 New<>York<>20 3909.0139 256 522 285 New<>York<>8 0.1407 256 522 285 --------------------------------------------------------------------------------------------- The results of logliklihood are nearly as same as that of user2. User2 gives a somewhat better performance for proper nouns such as New York. Overall Recommendation: ---------------------- After observing all the 4 results, pointwise mutual information test identifies collocations better than any of the rest. Since true mutual information for two dependent variables grows not with the degree of dependence but also according to entropy of the variables [1], it cannot give appropriate representation of the dependence of two words as in a collocation, hence it gives very high ranks to collocations. The user2 (Mutual expectation test) is expected to perform better for n-grams. ********************************* * Experiment No :2 * ********************************* Log Liklihood Ratio ------------------- Log likelihood ratio is performed on a 3 word sequence assuming the following null hypothesis (say words are x,y,z) : p(x,y,z) = p(x) * p(y) * p(z) The expected value e(x,y,z) assuming above hypothesis is given by: e(x,y,z) = ( n(x) * n(y) * n(z)) / (total trigrams * total trigrams) ) where, n(x) ,n(y) ,n(z) = observed values of frequeny of x,y,z respectively The log likelihood ratio LL is calculated as: LL = 2 * summation over x,y,z (n(x,y,z)* log(n(x,y,z) / e (x,y,z) ) User3- Mutual Expectation -------------------------- The concept of mutual expectation can easily be applied to trigrams. As mentioned before,this test of association is based on how well the n-gram accepts the loss of one of it's components [2]. THe formulae are as explained in user2. Top 50 Comparisons ------------------- Looking at the top 50 trigrams given by logliklihood and user3 a number of significant collocations are obtained from user3. Some are listed below. ---------------------------------------------------------------------------------------- part of output after running user3 ---------------------------------------------------------------------------------------- the<>United<>States<>1 0.1877 265 55619 453 443 351 278 343 as<>well<>as<>3 0.0996 199 5983 772 5983 273 547 213 per<>cent<>of<>20 0.0257 51 379 158 33018 137 70 56 Mr<>and<>Mrs<>21 0.0255 40 735 24023 469 44 40 79 the<>White<>House<>22 0.0252 39 55619 113 185 46 52 59 more<>or<>less<>34 0.0187 25 1920 3796 395 26 25 36 one<>of<>the<>36 0.0179 253 2663 33018 55619 471 311 8506 with<>respect<>to<>36 0.0179 37 6017 123 22443 50 95 54 ---------------------------------------------------------------------------------------- On the other hand, loglikelihood gives very high ranks to trigrams with punctuation and does not help significantly in collocation extraction.Some top trigrams as given by the ll3 module are listed below. ---------------------------------------------------------------------------------------- part of output after running ll3 ---------------------------------------------------------------------------------------- ,<>.<>The<>1 37376.3913 7 50254 41005 6204 57 30 4419 of<>.<>The<>2 36039.5281 1 33018 41005 6204 19 1 4419 .<>The<>.<>3 35007.9173 1 41005 6204 41006 4419 302 1 to<>.<>The<>4 34680.5109 2 22443 41005 6204 37 4 4419 a<>.<>The<>5 34526.7254 1 18789 41005 6204 15 1 4419 ---------------------------------------------------------------------------------------- Cutoff Point -------------- The cutoff point for user3 output can be observed around 120 to 150. however it is not a very sharp cutoff and some collocations maybe found beyond this limit. No such limit for ll3 was observed. Overall Recommendation ---------------------- Overall, user3 (mutual expectation) outperforms ll3 for finding collocations in trigrams.This maybe due to the fact that it considers the joint probabilities of all the (n-1)- grams in a n-gram. However, pointwise mutual information is better than mutual expectation for bigrams. References: ---------- [1] Manning C., Schutze H., "Foundations of Natural Language Processing". [2] Dias G., S.Guillore and J.G.P. Lopes " Extracting Textual Associations in Part-of-Speech Tagged Corpora". http://nl.ijs.si/eamt00/proc/Dias.pdf [3] Dias G., S.Guillore and J.G.P. Lopes " Mining Textual Associations in Text Corpora". http://www.cs.cmu.edu/~dunja/KDDpapers/Dias_TM.rtf.++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ############################################################################## ASSIGNMENT 2 THE GREAT COLLOCATION HUNT REPORT BY - AMRUTA PURANDARE ############################################################################## ------------ OBJECTIVE ------------ To investigate various measures of association which can be used to used to identify collocations in large corpora of text. Identify and Implement a measure that can be used with 2 or 3 word sequences and compare the results with other standard measures. ############################################################################## ========================================================================= SOME MATHEMATICAL AND STATISTICAL CONCEPTS USED FOR THE VARIOUS EXPERIMENTS CONDUCTED HERE ========================================================================= ---------------------------------------------------------------- Visualising 3 D data and mapping or transforming it to 2D space. ---------------------------------------------------------------- While dealing with the 3 Variable experiemnts, the following thoughts were given while anaysing the problem of identifying and visualising the relationship between 3 Random Variables. The random variables here are the words in the context which form a trigram. Thus for a trigram W1 W2 W3 which is a trigram of 3 words occuring in this particular sequence. The way we build a contingency table for tabulating the relation between 2 variables (words) using the contongency tables, we try to visualise and analyze the same relation for the set of 3 variables(words). ------------------------------------------ VISUALISING 3 VARIABLES WITH VENN DIAGRAMS ------------------------------------------ To translate the information about the relation between 3 words we can draw a Venn Diagram with 3 intersecting circles showing 7 possible regions. These regions are classified as R1 #Occurances of all the 3 words together #Number of times this trigram occurs = n111 value in the contingency table R2 #Occurance of 2 words in the trigram together and at their #Number of trigrams which contain only 2 of the 3 words in the trigram (at its right position/index) right position but absence of the third word in the trigram. These are the sections common to two circles but not to all three. = n112 , n211, n121 values in the contingency table. where 1 at ith position indicates the presence of the word at ith location in the trigram. And 2 indicates that the word is not present. e.g. looking at n112, we say that we have a trigram that looks like W1 W2 _ where W1 and W2 occur at 1st and 2nd but W3 doesn't occur at the third position. All the values n112, n121, n211 show the regions in Venn diagram common to two circles which represent those words. R3 #Occurance of any one word in the trigram. These are the sections of the circles which belong to only on circle and thus represent the occurance of a word without any of the other words in the trigram. = n122, n212, n221 show that only the word corresponding to 1 at ith position occurs. eg n122 shows the string W1 _ _ which are the trigrams starting with W1 but not followed by W2 or W3. R4 #Occurance of none of the words This is the section of the Venn Diagram that belongs to no circle indicating all the trigrams which dont contain either of the three words in the trigram. ------------------------------------------------------------ BUILDING A CONTINGENCY TABLE FROM THE FREQUENCY COMBINATIONS ------------------------------------------------------------ Using the Venn Diagram, we arrive at the deerivation of following formulae- The trigram is the combination of 0 1 2 0 1 2 0 1 1 2 0 2 Where these are actually f(0 1 2) = n111 = All together f(0) = n1pp = Total Occurances of W1 at 1st position = W1** f(1) = np1p Same as above = *W2* f(2) = npp2 = **W3 f(0 1) = n11p = Total occurances of pairs W1W2* which may be followed by W3. f(1 2) = np11 = same as above for *W2W3 f(0 2) = n1p1 = same for W1*W3 These are the basic values which help us to derive the entire contingency table for 3 Variables. ----------------------- DETERMINING OTHER CELLS ----------------------- We represent the contingency table values as n111 = All words together = n111 n112 = W1W2 not followed by W3 = n11p-n111 n121 = W1(!W2)W3 = n1p1-n111 n122 = W1(!W2)(!W3) = n1pp - n112 - n121 + n111 n211 = (!W1)W2W3 = np11-n111 n212 = (!W1)W2(!W3) = np1p - n112 - n211 + n111 n221 = (!W1)(!W2)W3 = npp1 - n121 - n211 + n111 n222 = (!W1)(!W2)(!W3) = n - n122 - n212 - n221 - n112 - n211 - n121 - n111 ------------------------------------- DRAWING A CONTINGENCY TABLE FROM THIS ------------------------------------- To draw a contingency table for the above data we represent W1 and W2 as rows and columns of the table. And in each of these cells, we place 3 possible values of W3 Contingency Table W2 W2' W1 n111 n121 n112 n122 W1' n211 n221 n212 n222 Where all the Top values in each cell represent W3 and all the bottom values represent W3' ############################################################################## ======================================= BI-PREDICTIVE EXACT TEST FREEMAN-HALTON TEST ======================================== =========== REFERENCE =========== [Miller] - EXACT TESTS FOR SAMPLE 3X3 CONTINGENCY TABLES WITH EMBEDDED FOUR FOLD TABLES, Matthias Miller, German Genral of Psychiatryi, strabe 8, D-55131 We implement here a bipredictive exact test to find the the degree of association between two words. ------------------------------------------------- HOW AN EXACT TEST DIFFERS FROM A CHI-SQUARE TEST ------------------------------------------------- Chi-square tests are unsuitable when the number of observed and expected values are very small. The exact tests differ from the chi-square test in that these are based on the cumulative point probabilities of all possible contingency matrices and compare the cumulative probability value with some pre-defined threshold alpha = 0.05 based on which the decision whether to accept or reject a null hypothesis is taken ---------- Procedure- ---------- The procedure to find the cumulative probability involves the calculaton of the point probabilities of all possible contingency matrices and adding only those which are same or as high as the observed point probability. Thus from a given contingency table, we first find its point probability n11/n++ for a 2x2 contingency table which is the probability of finding the two words together. To find all possible contingency tables we use the following algo which could be easily generalised to get the possible cell values Consider a contingency table, for 2 RANDOM VARIABLES X1 X2 X3 ------------------ Y1 3 4 5| y1 Y2 4 5 5| y2 Y3 2 2 4| y3 ---------------------------- This shows the relationship between two random variables (X,Y) each one of which can take 3 possible values. Note that the marginal values are fixed as the cell values are variables but for a given number of observations, N, number of instances of getting each category/type of the random variables is fixed. Thus for a fixed marginal indices, we find all possible contingency matrices by varying the cell values and computing a point probability for each of the combination. ---------------------------------- FINDING ALL POSSIBLE COMBINATIONS ---------------------------------- Lets first consider an example for 2X2 matrix as its easier to begin with and then we extend the algorithm for nXn tables. or even for nXnXn cubes (for 3 variables) 3 4 | Y1=7 2 3 | Y2=5 ---------- X1=5 X2=7 12 From this, we note that the first cell belongs to two groups namely, X1 and Y1. Hence the cell values in (1,1) can be varied from 0-min(X1,Y1) which is [0-5]. This is intuitive as the cell value can't be > 5 if the X1 has to be 5. (from n11+n21=X1, X1>=n11,n21) Hence, we say that the value of the cell (1,1) can be chosen from [0-5] interval and foreach value of n11 we get a new contingency table. As the other values can be calculated w/o any prediction or assumption, we have for 2X2 tables, min(X1,Y1) = #possible tables. e.g. T1 | T2 | T4 | T5 0 7 | 1 6 | 4 3 | 5 2 5 0 | 4 1 | 1 4 | 0 5 shows 5 possible tables apart from the observed table. From this we calculate for each of the tables, a point probability which is n11/n++ and we take cumulative adition of all those which are same or smaller than the observed value. In this case, it would be <= 3/12 which is the oberved value. We see only tables, T4 and T5 follow this criteria and hence for these we calculate the phi (coefficient of partial association) as we can not simply take the cumulative probability to accept/reject the hypothesis in this particualr program. Which is expected to return the test score. The test score is calculated as the phi= SUM (n11*n22-n12*n21)/(n1+ * np1 * n2p * np2 * (n11+n12)/(n++)) [Miller] for 2 variables and phi= SUM (P(x,y,z)-(p(x)*p(y)*p(z)))/(n1pp * np1p * n2pp * np2p * npp1* npp2)**1/3; [Miller] These scores give us an idea of the partial association between the variables for Point Probability of the Table < Observed Point probability The Logic of finding all possible contingency tables can be extended to a general case when either we have rxc contingency table or NxMxP 3D space for 3 Variables (This is discussed in user3.pm). Re - consider our first example for 3x3 contingency table, 3 4 5| Y1=12 4 5 5| Y2=14 2 2 4| Y3=8 ------------------------------ X1=9 X2=11 X3=14 N=34 From this data, we note that the fixed marginal values associated with the cells. Or each cell belongs to 2 groups in 2D structure n11 => X1,Y1 n12 => X2,Y1 n13 => X3,Y1 n21 => X1,Y2 n22 => X2,Y2 n23 => X3,Y2 n31 => X1,Y3 n32 => X2,Y3 n33 => X3,Y3 We start with cell (1,1) forwhich the possible range is [0-9] We vary the value of n11 from 0 to 9 Foreach value of n11, we find range of n12 such that the rule of marginal totals holds true i.e. n12+n13 = Y1-n11 ;vary n11 in [0-9] Also, n12+n22+n32 = X2 Thus, the range of n12 is calculated as min(Y1-n11,X2) In the above example, it is [0-min(11,12-n11)] Once we know, n11 and n12, we can find n13. Now For n21, we consider, n21+n31=X1-n11 and n21+n22+n23=Y2 Thus n21 can take any value in the interval [0-min(X1-n11,Y2)] as this satisfies the condition of fixed marginal values. After finding n21, n31 can be easily calculated from the fact that we know n11 and n21 and X1 and n11+n21+n31=X1 To find, n22, we consider the equations, n21+n23=Y2 and n22+n32=X2-n12 from which the range of n22 is found. For n22, we can find n23 and n32 and then also n33 as all other variables in the equation are given values. After we assign this set of values to the current table cells we say that we obtained one of all possible contingency tables. And then we perform the test if the point probability of this combination is < the observed point probability if yes, then we find the score. Note that the process of assigning all possible values to all the cells in the table could be easily generalised though we haven't made an attempt here for simplicity. However, there are algorithms which would help us in assigning the values to all matrix cells based on the information of to which groups they belong or the marginal values associated with the cell and which decide the range of other cells in the same group. ----------------------------------------- PERFORMING SAME PROCEDURE FOR 3 VARIABLES ----------------------------------------- --------------------------------- CONTINGENCY TABLE FOR 3 VARIABLES --------------------------------- W2 W2' W1 n111 n121 n112 n122 W1' n211 n221 n212 n222 ---------------------------------------- FINDING ALL POSSIBLE CONTINGENCY TABLES ---------------------------------------- In this we see that the marginal values are X1, X2, Y1, Y2, Z1,Z2 where X1=W1pp, X2=W2pp Y1=Wp1p Y2=Wp2p Z1=Wpp1 Z2=Wpp2 these are fixed. So we vary the cells so that the rule for constant marginal values is obeyed. Lets start with cell n111 which can take all possible values from 0 to the maximum decided by the marginal groups to which this cell belongs. n111 belongs to groups W1, W2 and W3 hence, the upper limit on the possible values n111 can take is decided by min(X1,Y1,Z1). Vary n111 in the range [0..min(X1,Y1,Z1)] Foreach value of n111 find possible values of n121 where n121 belongs to (X1,Y2,Z1) and hence n121 can take values in the range [0..min(X1-n11,Y2,Z1-n111)] as the bounds on X1 ,Y1 and Z1 change as n111 is assigned a value. let X1'=X1-n111 Y1'=Y1-n111 and Z1'=Z1-n111 so after this we get X1'-=n121 Y2'-=n121 Z1'-=n121 Vary n121 in this range and foreach value of n121 find possible values of n112 which can now take the values in the range [0..min(X1',Y1',Z2)] so vary n121 in this range. After assiging n121 a value we immediately get n122 which is as W1pp=n111+n121+n122+n112 update X1', Y1' and Z2 for n112 and X1',Y2' and Z2' for n122 Now we find the possible range for n211, which is [0..min(X2',Y1',Z1')] and update X2',Y1' and Z1' by subtracting n211 from them. For each of these we find the range for n221 which is [0..min(X2',Y2',Z1')] and update X2', Y2' and Z1' After we get this, n212 and n222 can be asily found from the facts Wp1p=n111+n112+n211+n212 and Wpp2=n112+n122+n212+n222 Foreach of the contingency matrix thus formed find the point probability associated and cumulate those which are same or less than the observed point probability. Find coefficient of partial correlation phi which is P(x,y,z)-P(x)P(y)P(z)/(P1++ * P2++ * P+1+ * P+2+ * P++1 * P++2)^1/3 The copy of the paper which describes this test is provided in the same directory for further reference. Example W2 W2' | 2 1 | W1 | W1++= 7 2 2 | ------------------------------ | 3 1 | W1' | W2++=9 3 2 | ------------------------------- W+1+=10 W+2+=6 N= 16 W++1= 7 W++2=9 For n111, range is [0..min(7,10,7)] = [0..7] For n121, range is [0..min(7-n111,6,7-n111)] For n112, range is [0..min(7-n111-n121,6-n121,9)] n122= 7-n111-n121-n112 For n211, range is [0..min(9,10-n111-n112,9-n112-n122)] For n221, range is [0..min(9-n211,6-n121-n122,7-n111-n121-n211)] n212 = W+1+ -n111 -n112 -n211 n222 = W+2+ -n121 -n122 -n221 ############################################################################## EXPERIMENTS , OBSERVATIONS AND CONCLUSIONS ===================== FOR BIGRAMS ===================== =================== TOP 50 COMPARISIONS =================== With tmi.pm the Top 50 ranks obtained are This is an output when test1.cnt was given to tmi.pm This is directly taken from test1.tmi 228385 ,<>and<>1 5137.9468 3410 15448 6652 of<>the<>2 2831.5441 2756 7919 15578 .<>The<>3 2733.2180 766 6113 785 had<>been<>4 1783.7571 468 2000 748 ,<>but<>5 1515.4185 765 15448 1025 the<>enemy<>6 1300.0674 532 15578 570 on<>the<>7 1297.3587 928 1850 15578 the<>the<>8 1133.2538 1 15578 15578 ,<>,<>9 1106.8880 2 15448 15448 the<>,<>10 1068.7076 11 15578 15448 .<>In<>11 849.1795 241 6113 247 o<>clock<>12 789.9850 90 90 91 General<>Grant<>13 788.2227 158 877 202 in<>the<>14 710.0651 955 3504 15578 .<>This<>15 694.0795 201 6113 210 enemy<>s<>16 682.5241 220 570 1956 did<>not<>17 636.3140 130 189 760 s<>division<>18 615.1822 177 1956 342 .<>When<>19 541.5669 152 6113 154 a<>few<>20 506.7795 140 3081 183 Court<>House<>21 506.2121 69 83 123 it<>was<>22 505.3449 252 1161 3010 of<>,<>23 502.0967 13 7919 15448 would<>be<>24 485.1509 147 572 1001 .<>He<>25 477.0930 133 6113 134 to<>be<>26 450.7241 290 6619 1001 .<>At<>27 447.2214 126 6113 128 .<>On<>28 433.4946 121 6113 122 ,<>however<>29 433.1055 166 15448 169 s<>brigade<>30 432.3818 121 1956 216 ,<>.<>31 427.0564 1 15448 6113 .<>As<>32 426.2314 119 6113 120 .<>It<>32 426.2314 119 6113 120 .<>the<>33 425.0891 2 6113 15578 Sixth<>Corps<>34 422.1696 65 89 189 ,<>who<>35 413.9050 252 15448 415 to<>,<>36 406.7600 14 6619 15448 however<>,<>37 391.7628 159 169 15448 OF<>THE<>38 389.9416 68 159 185 United<>States<>39 388.0526 47 49 75 I<>had<>40 382.8230 245 2600 2000 ;<>but<>41 379.0308 132 734 1025 could<>be<>42 371.8051 118 517 1001 ,<>which<>43 368.7570 350 15448 934 I<>was<>44 360.6061 278 2600 3010 to<>the<>45 357.7431 1077 6619 15578 could<>not<>46 351.4256 106 517 760 .<>I<>47 340.4961 368 6113 2600 ;<>and<>48 339.9902 217 734 6652 =============================== ANALYSIS OF THE ABOVE RESULTS =============================== ******************* ANALYSIS 1 ******************* ------------- OBSERVATIONS ------------- (1) Most of the collocations consist of puncuations which dont really play any special role in puncuations (2) Bi-grams with punctuations on the TOP??? (3) Lot of trivial collocations have got over emphasis and higher rankings like did not, had been, on the, of the which do occur frequently but are not collocations as they dont follow the non-compositionality rule. (4) Some good collocations like United States, Court House are lower ranked (5) The ratio of good collocations to the bi-gram size under consideration is very low which is about 0.06. ========================== CONCLUSIONS FOR test1.tmi ========================== True Mutual Information has some inherent drawback by which it can't differentiate between good collocations and some trivial collocations which do occur most frequently but dont provide any information. This problem could be handled using a stop-list for indivisual words as well as for bi-grams which will list not only the tokens which can be ignored but also will list some trivial bi-grams which need not be analysed. On the other hand, the output obtained from user2.pm shows some good ranking scheme even when the stoplist is not provided which is really surprising. ********************** ANALYSIS 2 ********************** ON test1.user2 228385 o<>clock<>1 1571.3650 90 90 91 ,<>and<>2 1425.9268 3410 15448 6652 Five<>Forks<>3 877.8387 34 35 37 had<>been<>4 796.4478 468 2000 748 Cold<>Harbor<>5 793.1611 24 24 24 Rio<>Grande<>6 778.9809 28 28 32 Court<>House<>7 772.4336 69 83 123 United<>States<>8 764.9516 47 49 75 .<>The<>9 738.5375 766 6113 785 Project<>Gutenberg<>10 724.5090 26 31 26 Front<>Royal<>11 711.9523 21 21 22 of<>the<>12 683.0092 2756 7919 15578 UNITED<>STATES<>13 480.8483 10 10 10 New<>Orleans<>14 471.7674 37 82 44 Sixth<>Corps<>15 458.8941 65 89 189 rifle<>pits<>16 437.9938 16 22 19 General<>Grant<>17 425.4898 158 877 202 PROJECT<>GUTENBERG<>18 418.1405 8 8 8 Crown<>Prince<>19 416.3935 19 20 35 Missionary<>Ridge<>20 397.1066 23 23 53 Corpus<>Christi<>21 383.1539 7 7 7 Father<>Pandoza<>21 383.1539 7 7 7 Cedar<>Creek<>22 376.0958 41 42 141 Count<>Bismarck<>23 356.5804 28 47 54 Literary<>Archive<>24 344.8794 6 6 6 OF<>THE<>25 341.1757 68 159 185 did<>not<>26 340.5760 130 189 760 Camp<>Supply<>27 334.1003 12 18 15 Black<>Kettle<>28 316.0808 7 9 7 MAJOR<>GENERAL<>29 310.4870 29 30 100 Small<>Print<>30 306.7674 6 7 6 Jules<>Favre<>30 306.7674 6 6 7 la<>Tour<>30 306.7674 6 7 6 Mars<>la<>30 306.7674 6 6 7 Piedras<>Negras<>31 302.2383 5 5 5 Yaquina<>Bay<>32 302.2195 9 9 15 White<>Oak<>33 296.8036 21 64 21 Bowling<>Green<>34 276.5141 6 6 8 CITY<>POINT<>35 273.6775 6 7 7 Beaver<>Dam<>36 269.5503 7 11 7 Yellow<>Tavern<>37 263.2275 12 16 23 le<>Duc<>38 263.0624 5 6 5 Medicine<>Lodge<>38 263.0624 5 6 5 Blue<>Ridge<>39 258.7102 17 17 53 San<>Antonio<>40 254.6795 8 15 8 widow<>Glenn<>41 253.4355 4 4 4 Des<>Chutes<>41 253.4355 4 4 4 Archive<>Foundation<>42 251.8580 6 6 9 ,<>but<>43 232.3167 765 15448 1025 =========================== OBSERVATIONS ON test1.user2 =========================== (1) Results seem to be better than results obtained by tmi in the sense that the overall % of trivial collocations like had been, I was, I had is low. (2) Overall % of collocations involving punctuations is also low. (3) Good collocations like United States, Court House are given satisfactorily higher ranks. (4) Most of the letters in this output are capital letters. ============================ CONCLUSIONS FROM test1.user2 ============================ How is it possible that when the same data is given input to two seperate tests, one contains most of the Capital Letters while other contains almost all in lower case. Does this mean that the user2 shows some bias towards Capital letters or specifically speaking, bias towards those words which start with Capital letters. As we know that none of the program really considers the Case - senstitive handling, the results are surprising. As user2 almost all involves words starting with Upper case letters while tmi.pm selects all lowercase letters. ======================================= CONCLUSIONS ON test1.usr2 and test1.tmi ======================================= The Top 50 list shows very few bi-grams in common. When these two (very much unrelated) files are given to rank.pl which calculated the correlation, we confirmed the fact by getting the value very close to 0 which means unrelated. The scores generated in this Top 50 list were observed and it was found that tmi.pm 's output shows the test scores [340-5137] while the range of the test_scores for user2 is [232-1571] which is quite narrow compared to tmi.pm. How to Categorize the Top 50 produced by each of the modules- (1) Divide the set of bi-grams into subclasses according to their scores. e.g. If the scores are In the file we saw, the typical phrases like move back, move forward, come back, move rapidly were found close to each other in their rankings and scores. Also, phrases like several kinds, various points, many kinds were found together. Following is an another example that supports this fact- prime minister, grand jury, head quarters were found close. on account, rather than, as soon (part of as soon as), by the (part of by the ways) were found close to each other. another examples- included by, represented by, responded by, traversed by, influenced by were all found together. The reasoning could be that the algorithm takes into account the possible relation between the frequencies of the typical phrases which could be more or less same. (2) Also, we observed that the ranks of the Proper Nouns are higher than any other bi-grams which are listed on the top of the list of bi-garms. (3) Bi-grams containing the punctuation marks are at the bottom of the list with low scores which is very helpful and sound. The Cut-off-point frequency for this could be around 7700 after which some trivial collocations and those involving punctuation marks come. ############################################################################# ============================= RANK COMPARISIONS ============================= COMPARISION TABLE ll.pm mi.pm dice.pm x2.pm tmi.pm 1.00 0.777 0.6684 0.9275 user2.pm 0.7123 0.4083 0.1927 0.5514 This table shows various comparisions made using rank.pl for the results obtained using several tests. We compare here the coefficient of co-rrelation of out tests with the standard tests already in NSP like ll.pm, dice.pm, x2.pm or mi.pm. The comparision table shows that tmi is more simialr to log-likelihood test as we get 1 there. Next it is more similar to x2 test where the coefficient is 0.9275. The low coeffcients of comparisions for user2 with other tests shows that this is a different kind of test as we said that this test accepts/rejects a null hypothesis based on the cumulative probability obtained by considering all possible contingency matrices which have point probabilities below the observed probability. Due to its un-usual way of analysing the same problem we get low correlation with the standard tests. However, the results in fact show very good collocations standing in the top of the list while some trivial ones at the bottom. ********************************************************************************* ===================== TRIGRAM EXPERIMENTS ===================== ------------- TOP 50 RANK ------------- Using ll3.pm Log-likelihood topll3 file 228384 ,<>and<>the<>1 10126.2188 280 15448 6652 15578 3410 1238 470 ,<>and<>,<>2 9700.4814 68 15448 6652 15448 3410 976 90 of<>,<>and<>3 9390.5841 3 7919 15448 6652 13 212 3410 to<>,<>and<>4 9236.3700 6 6619 15448 6652 14 78 3410 ,<>and<>I<>5 9158.8664 148 15448 6652 2600 3410 448 190 ,<>and<>of<>6 9111.1531 15 15448 6652 7919 3410 192 23 .<>,<>and<>7 9003.4943 7 6113 15448 6652 52 42 3410 ,<>and<>as<>8 8976.2952 124 15448 6652 1431 3410 241 144 ,<>and<>to<>9 8846.3357 60 15448 6652 6619 3410 299 82 ,<>and<>that<>10 8754.8857 99 15448 6652 2406 3410 242 109 ,<>and<>then<>11 8707.3660 71 15448 6652 312 3410 78 96 however<>,<>and<>12 8667.4399 28 169 15448 6652 159 30 3410 ,<>and<>in<>13 8657.3422 83 15448 6652 3504 3410 258 104 in<>,<>and<>14 8594.8857 5 3504 15448 6652 17 47 3410 I<>,<>and<>15 8508.5442 1 2600 15448 6652 17 2 3410 ,<>and<>on<>16 8471.9385 62 15448 6652 1850 3410 153 74 ,<>and<>when<>17 8458.3958 53 15448 6652 401 3410 93 61 ,<>and<>he<>18 8456.1310 47 15448 6652 1126 3410 212 50 was<>,<>and<>19 8447.5863 1 3010 15448 6652 24 39 3410 ,<>and<>was<>20 8446.6748 38 15448 6652 3010 3410 313 43 ,<>and<>we<>21 8412.2439 49 15448 6652 796 3410 150 56 ,<>and<>a<>22 8387.9178 54 15448 6652 3081 3410 259 115 ,<>and<>with<>23 8346.7838 43 15448 6652 1592 3410 122 50 ,<>and<>at<>24 8344.0516 44 15448 6652 1498 3410 147 51 ,<>and<>by<>25 8338.1816 40 15448 6652 1677 3410 187 47 ,<>and<>had<>26 8315.8525 29 15448 6652 2000 3410 219 34 ,<>and<>it<>27 8312.4198 32 15448 6652 1161 3410 189 41 ,<>and<>also<>28 8296.5899 33 15448 6652 214 3410 53 37 ,<>and<>from<>29 8281.7874 32 15448 6652 1319 3410 71 41 by<>,<>and<>30 8278.4059 2 1677 15448 6652 9 24 3410 with<>,<>and<>31 8276.6025 4 1592 15448 6652 7 51 3410 s<>,<>and<>32 8275.1251 5 1956 15448 6652 26 94 3410 ,<>and<>there<>33 8267.2684 33 15448 6652 428 3410 75 52 ,<>and<>though<>34 8264.2430 28 15448 6652 148 3410 33 31 ,<>and<>this<>35 8260.8464 32 15448 6652 983 3410 80 35 ,<>and<>after<>36 8249.5074 30 15448 6652 312 3410 55 33 ,<>and<>for<>37 8243.3453 16 15448 6652 1531 3410 49 22 me<>,<>and<>38 8243.1961 30 768 15448 6652 104 34 3410 cavalry<>,<>and<>39 8242.2104 22 368 15448 6652 95 24 3410 on<>,<>and<>40 8238.1017 2 1850 15448 6652 33 16 3410 from<>,<>and<>41 8225.8020 1 1319 15448 6652 6 17 3410 him<>,<>and<>42 8224.4330 27 557 15448 6652 83 29 3410 Corps<>,<>and<>43 8206.2651 19 189 15448 6652 60 20 3410 men<>,<>and<>44 8206.2261 21 427 15448 6652 90 25 3410 Creek<>,<>and<>45 8204.7301 16 141 15448 6652 58 17 3410 ,<>and<>my<>46 8202.8438 27 15448 6652 1265 3410 97 44 line<>,<>and<>47 8191.8624 23 349 15448 6652 61 26 3410 road<>,<>and<>48 8189.9568 14 322 15448 6652 87 17 3410 ,<>and<>his<>49 8189.0707 26 15448 6652 1315 3410 115 51 ====================== OBSERVATIONS ====================== (1) There are no good collocations as most of the trigrams consist of ounctuations and trivial phrases and conjunctions these could be improved using the stopfile ################################################################################# ------------------- CUTOFF POINTS ------------------- For test.ll3 No cut-off points was observed as the collocations are trivial and we can't obviously see good collocations separated from the non-collocations. The results is not naturally categorized which was the case with user2.pm. Its hard from this trigram table to determine the cut-off point as even trgrams on the top of the list are not so good. ################################################################################# ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ REPORT ----------------------------------------------------------------------------- Prashant Rathi Date -11th Oct. 2002 CS8761 - Natural Language Processing ----------------------------------------------------------------------------- Assignment no.2 Objective :: To investigate various measures of association that can be used to identify collocations in large corpora of text. To identify and implement a measure that can be used with 2 and 3 word sequences, and compare this method with some standard measures. The experiments were performed on a large Corpus.It is available at /home/cs/rath0085/CS8761/nsp/brown.txt . It contains about 1510192 bigrams and 151890 trigrams. ----------------------------------------------------------------------------- EXPERIMENT 1: File No.1 - tmi.pm implements true mutual information for 2 word sequences. It is calculated as, TMI = summation( P(x,y) log P(x,y)/P(x) * P(y)) It is implemented for bigrams and the count file generated by count.pl is passed to it.The program can be run as, statistic.pl tmi outputfile inputfile File No.2 - user2.pm implements the Yule's Coefficient of Association for bigrams. The reference is available in the book Multiway Contigency Table Analysis for Social Sciences by Thomas D. Wickens.It is formulated using the cross-product ratio alpha. It is the ratio of the odds in the first row (or column) to that in second row (or column). alpha = N11 * N22 / N21 * N12 The Yule's coefficient of Association is given as, Q = (alpha -1)/(alpha + 1) The Program shows same results as obtained by hand computations. Top 50 Comparisons: Comparing the first 50 ranks produced in both the outputs for tmi and user2 yields interesting results.First few ranks produced by tmi are, of<>the<>1 0.0098 9123 36043 62690 .<>The<>2 0.0071 3507 49673 6921 in<>the<>3 0.0059 5265 19580 62690 .<>He<>4 0.0049 2029 49673 2991 ,<>and<>5 0.0046 5376 58974 27952 .<>It<>6 0.0033 1405 49673 2185 ,<>but<>7 0.0031 1615 58974 3006 on<>the<>8 0.0029 2190 6405 62690 to<>be<>8 0.0029 1606 25724 6340 United<>States<>9 0.0027 351 456 447 the<>the<>10 0.0026 3 62690 62690 ........... It is seen here that the collocations are commonly occurring ones like "of the","on the" and they are large in number. Such collocations are observed in the top 50 ranks of tmi.pm.On the other hand first few ranks produced by user2.pm are, cowardly<>compromising<>1 1.0000 1 3 4 tropical<>sprue<>1 1.0000 1 11 2 Elisabeth<>Carroll<>1 1.0000 1 3 17 Nero<>Wolfe<>1 1.0000 2 3 9 aviation<>gasoline<>1 1.0000 1 2 11 polite<>minuet<>1 1.0000 1 7 2 favored<>splitting<>1 1.0000 1 18 3 powered<>radios<>1 1.0000 1 6 7 Brooks<>Atkinson<>1 1.0000 1 17 2 heroin<>addicts<>1 1.0000 1 2 4..... Here the collocations are more interesting as they are uncommon ones like "cowardly compromising".They occur very infrequently.Also it can be seen here that many collocation give same value for this test and rank 1 has many entries. I think user2.pm is better than tmi to get interesting collocations as tmi.pm produces ranks highly dependant on the no. of times they occur. Cutoff Point tmi. pm produces values for bi-grams in the range from 0 to 1.It produces very few ranks ( many collocations have the same rank) as compared to user2.pm.The cutoff point for tmi.pm is as shown, with<>and<>32 0.0001 7 7007 27952 is<>one<>32 0.0001 100 9969 3073 5<>culminating<>33 0.0000 1 10611 2 the<>Sierras<>33 0.0000 2 62690 2 ...... The values for all next bigrams is 0 which is true for more that 90% of bigrams.Such bigrams are very uncommon and so their probability of occurrence nears 0. user2.pm produces similar values for many bigrams and the range observed is from -1 to 1.The cutoff point observed is as shown and below this all trigrams give negative values which indicates that the cross-product ratio is less than 1. amount<>for<>9872 0.0001 1 171 8833 or<>look<>9872 0.0001 1 4127 366 the<>J56<>9873 0.0000 10 62690 241 to<>G62<>9874 -0.0001 4 25724 235 to<>G53<>9874 -0.0001 4 25724 235 ...... This also means that the two words in such bigrams occur individually a lot more than they occur together. Thus, tmi produces shows a steep curve while for user2 the values decrease gradually.This may be because the frequency count dominates tmi more than it does in user2. Rank Comparison tmi produces rank correlation coefficient of -0.9493 with mi.pm and user2.pm produces rank correlation coefficient of 0.7311 with the same. This means that tmi tends to produce results opposite to mi while user2.pm produces results somewhat similar to mi.Here also the corpus size matters as if the corpus size is small then tmi produces results more similar to mi then is the case for large corpus. With ll3, it was observed that tmi produces -0.9491 as the rank correlation coefficient and user2 produces 0.5863 with ll. Again we can say that user2 is more similar to ll than tmi.These results taken for large corpus have shown above values, may be the results will be different for other corpus. Overall Recommendations From the observed results, I think ll and user2 are good tests for identifying collocations. Tmi shows questionable results when compared with mi for large corpus. This is because tmi produces very few ranks i.e. the values tend to be similar and hence due to the limited range of values the rank correlation coefficient differs.It is difficult to identify interesting collocations on the basis of tmi values.Also as we observed above, user2 helps us to find interesting and unfrequent collocations. ----------------------------------------------------------------------------- EXPERIMENT 2: File No.3 - ll3.pm implements the log-likelihood ratio for trigrams. The formula for the same is, GSquare = 2 * summation(Obs.value *log( Obs. value/Expected Value)) The observed values can be seen by building a 2 x 2 x 2 table. The Expected value is calculated using the null hypothesis P(x,y,z)=P(x)P(y)P(z). Thus the expected value in terms of frequency count is, exp(x,y,z)=n(x)*n(y)*n(z)/ (totaltrigrams)*(totaltrigrams) File No.4 - user3.pm implements the Yule's Coefficient of Association for trigrams. The extension of the test for three dimensions is mentioned in Ch.9 of the book Multiway Contigency Table Analysis for Social Sciences by Thomas D. Wickens. It is formulated by extending the cross-product ratio to three dimensions.It is given as, alpha = (N111 * N122 * N212 * N221)/(N112 * N211 * N121 * N222) The Yule's coefficient of Association is given as, Q = (alpha -1)/(alpha+1) The Program shows same results as obtained by hand computations. Top 50 Comparisons: The following results are observed for top 50 ranks in ll3.pm. of<>the<>fact<>41 30378.7574 19 36043 62690 446 9123 21 159 of<>the<>way<>42 30365.6054 14 36043 62690 909 9123 21 215 of<>the<>state<>43 30365.0389 33 36043 62690 603 9123 44 180 of<>the<>last<>44 30358.6645 16 36043 62690 642 9123 20 183 of<>the<>best<>45 30357.9556 19 36043 62690 349 9123 24 141 series<>of<>the<>46 30357.4622 1 122 36043 62690 76 3 9123 because<>of<>the<>47 30355.6422 56 807 36043 62690 156 59 9123 front<>of<>the<>48 30342.5881 40 226 36043 62690 97 45 9123 of<>the<>entire<>49 30338.2260 14 36043 62690 148 9123 15 97 One<>of<>the<>50 30328.0345 77 417 36043 62690 100 83 9123....... The values are very high as can be seen in the table.The values do not directly depend on the frequency of occurrence of trigrams. For user3.pm the trigrams in the first 50 are quite different from that of ll3.pm.Here the values do not change that frequently and trigrams that are seen in the first few ranks are very infrequent. a<>6<>6<>1 1.0000 1 21994 10435 10435 10 2 4 a<>5<>4<>1 1.0000 1 21994 10611 10399 6 2 2 and<>3<>8<>1 1.0000 2 27952 10406 10326 21 3 5 the<>8<>4<>1 1.0000 1 62690 10326 10399 6 4 4 4<>7<>8<>1 1.0000 1 10399 10291 10326 2 4 3 an<>11<>3<>1 1.0000 1 3543 6866 10406 3 2 3 a<>,<>9<>1 1.0000 1 21994 58974 10010 5 2 4 ............ Here ll3 seems to be the better test to identify the interesting collocations because it produces different values for these trigrams which helps in distinguishing them and also the frequency of trigrams doesn't affect these values. Cutoff Point Both ll3 and user3 produce quite different values for the trigrams and the ranks produced are also different.There is no particular point which can be called as the cutoff-point below or above of which we can find interesting collocations. For ll3 the range of values is large and the curve will be steady whereas for user3 the values are in the range of -1 to 1 and the curve will have flat portions as there are many trigrams with similar values. Rank Comparison The rank correlation coefficient between the ll3.pm and user3.pm as shown by rank.pl is 0.0153.This value is pretty much close to 0 and this shows that these tests do not have much relation among them.This can be easily seen from the results. Overall Recommendations The loglikelihood test implemented for trigrams is much better than the user3 (yule's coefficient).This is beacause ll3 identifies interesting collocations very well due to the large range of differing values it produces. It is also observed that the frequency of trigram occurrence does not play a major role in computing values using these tests. The Yule's Coefficient calculated also gives a measure of association different from the ones already existing in NSP. Conclusion : The test results differ depending on the corpus size.tmi and mi show very contrasting results for large corpus.The log likelihood ratio shows good results when extended for trigrams.The measure of association implemented by me shows similarity with log likelihood more for bigrams then for trigrams. ----------------------------------------------------------------------------- ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ CS8761: Assignment 2 - Statistical Models for NSP (due 10/11/2002 12pm) Sam Storie ID 1824410 10/8/2002 +------------------------------------------------------------------------+ | Methods | +------------------------------------------------------------------------+ Introduction: The main goal of this assignment is to implement a couple different statistical measures as modules within the N-Gram Statistics Package (NSP). NSP is useful for counting n-gram sequences within bodies of text and then performing some statistical analysis on the data generated. In addition to providing 4 statistical measures, NSP provides a modular interface for adding new statistical tests, and also provides a program to perform Spearman's Rank Correlation Coefficient (which is useful to show how similar two rankings of values are). Given all these tools, NSP makes it relatively easy to observe how related different statistical measures actually are. All the measures included with the NSP package are designed to work with bigram counts, but some statistical measures are able to work with trigrams or even any n-gram count. A good example is the log-likelihood ratio, which comes as a bigram test with NSP, but is easily expanded to analyze a count of trigrams. This was basis for one of our tasks in this assignment. NSP also comes with a module to determine the "pointwise mutual information" value. This value is interesting, but it only considers a specific case in its computation. A possibly more meaningful value is the "true mutual information" which enumerates over all the values in a distribution in an attempt to consider all the counts. This statistic was the basis for another portion of the assignment. The tasks for this assignment included: (a) Implementing a module to determine the "true" mutual information (b) Finding a new statistical measure suitable for 2 and 3 word sequences and implementing it for both cases (c) Implementing a 3-word version of the log-likelihood ratio For my new measure I decided to use the Kolmogorov-Smirnov test (2) for goodness of fit between two distributions. It performs this test by enumerating over all the samples in the distributions, and summing the probabilities as it does so. The result is a cumulative sum of the different probabilities after each iteration. Then this test compares the sum of one distribution to the sum of the other and tracks the largest difference seen at each iteration. This value, called the 'D' value, is then compared against a standard table to determine if it has any statistical significance. Since this test compares the expected counts to the observed counts it scales nicely from the 2-word case to the 3-word case. The 3-word case differs from the two word case in only the number of cells of counts (2-word has 4 cells in the contingency table and the 3-word case has 8). Examples of the actual computations performed with this test are included in the user2.pm and user3.pm source files. Experiment one is designed to compare my true mutual information module with my Kolmogorov-Smirnov (KS) implementation (user2.pm). Then I will compare my modules to the log-likelihood and pointwise mutual information modules included with NSP. Finally I will try to give a recommendation on what I feel is the best measure for finding interesting 2-word collocations. Experiment two is similar to the first except it's concerned with 3-word sequences. I will compare my 3-word version of my KS module (user3.pm) with my implementation of the 3-word log-likelihood module. As with the first experiment, I will comment on which test I feel produces the most useful output. The corpus used: I combined several files from the BROWN1 corpus supplied by Dr. Pedersen to create a corpus of approximately 1,015,000 words. I feel this is a sufficiently large sample to base my experiments on. Also, the BROWN corpora strive to be representative of a wide variety of texts, so this is even more incentive towards its use. I also removed the meta-imformation from the text files to my corpus consisted of only the original text in the the BROWN corpus. +------------------------------------------------------------------------+ | Experiment One | +------------------------------------------------------------------------+ TOP 50 COMPARISON When the top 50 bigram results produced by the true mutual information (tmi) module are compared to the results of my KS module some interesting results emerge. On the surface they appear to produce similar results, but upon a closer inspection something else appears. It is true that some bigrams are ranked high with each case, but the KS module seems to be affected more by the marginal values than the tmi module is. The tmi module ranked some semantically interesting bigrams rather high. "United States" for example, is ranked 11th, but fails to appear in the top 50 KS rankings. Another bigram that is ranked high by tmi is "New York", but again it fails to appear in the top 50 for KS. On the other hand, the KS rankings seem to favor those bigrams containing elements that occur quite often outside the bigram. For example, many of the top 50 bigrams in the KS rankings include a punctuation mark. After some thought I think because the tmi uses the probability of the different cells to provide some "weight" to how much influence that cell has on the total result. Those bigrams that occur more than expected otherwise will have a larger influence over those that are more evenly distributed across the contingency table. With KS a different case can be made. KS is comparing the sums of the probabilities, so those bigrams that have elements with larger marginal values than we expect could influence the result more than those bigrams that simply occur more than we might otherwise expect. The result is the KS test might miss those bigrams with semantic importance, but it still finds those bigrams with counts that deviate from what we would expect. Also, the KS test does not discriminate against observed counts that are higher *or* lower than expected. It only examines the deviation when it assigns a value to the result. Based on initial observations I would say that the tmi module finds more interesting bigrams than the KS test. I feel this is because of the "weight" the tmi test places on the contribution of each cell in the contingency table. It should be stated that both tests fail miserably if a person was concerned with bigrams that form collocations satisfying the compositional meaning requirement. These tests are designed to find those bigrams that deviate from what we would expect in a quantitative sense, and not a qualitative sense. Tables 1 and 2 (at the end of this section) show the top 50 bigram output produced by the tmi and KS modules. CUTOFF POINT To determine an appropriate cutoff is difficult with these two measures because, as I stated earlier, they don't rank based on any sort of semantic meaning. They are simply ranked based on the frequency counts for the various cells in the contingency table. Also, as seen from the output, the values calculated are all fairly close to zero, so the interpretation of these results is difficult. None of the 'D' values computed satisfy the critical value associated with a distribution of 4 samples. There is some vague cutoff that one could argue based on the transition from punctuation containing bigrams to those with semantic words. In both cases the rankings eventually trend towards bigrams containing two words, but it doesn't really start until the 1000th word in the ranking or so. This is a gross approximation, but it still helps illustrate my difficulty in determining an appropriate cutoff value for either of these distributions. RANK COMPARISON These values were computed using the rank.pl program included in NSP. In each comparison the ll.pm or mi.pm results were always listed first in the command line for rank.pl. The results are summarized in the table below: tmi.pm user2.pm (KS module) +-----------+-----------+ ll.pm | -.9616 | -.9805 | +-----------+-----------+ mi.pm | -.9613 | -.9801 | +-----------+-----------+ In each case the results indicate a near reverse ranking for each comparison. This may seem very strange but I believe there is a reason. I feel that because neither the ll.pm module or the mi.pm module directly consider the number of bigrams in the entire sample space, they would produce results opposite of the other case. The major factor in both the ll.pm and mi.pm modules is the comparison between the observed and expected counts. The ll.pm module goes one step farther and factors in the counts for each cell as it enumerates over all of them. The important difference is that the tmi.pm and user2.pm modules use a technique that scales these values according to the total number of bigrams in the sample space, and I believe that will have an inverse-like effect on the rankings for many of the bigrams. The excpetions are those that occur so often as to remain relatively unaffected by these skewing factors, but this is shown by the resultant comparison *not* being -1.0, but rather some value close to it. A rank.pl computation on the tmi.pm results with the user2.pm results shows a value of .9999, which clearly indicates a relationship between the results of the two tests. I feel this relationship has something to do with how they consider the total number of bigrams in their computations. OVERALL RECOMMENDATION In order to make a recommendation, one must be clear on what sort of bigrams are important. I feel that both the tmi.pm and user2.pm do a good job of finding bigrams that contain words with some sort of relationship. The importance of this relationship is up to the user however. If I was to look for bigrams that form some collocation with compositional meaning, then these tests would fail me miserably. However, if I was just searching for bigrams with words that occur together more often than I might expect if they were independent (represented by the "expected" distribution), then I feel these provide a good measure for the top 100 or so rankings. My prior knowledge contains the fact that the log-likelihood test is considered to be a rather meaningful test in this context. Regardless of the Spearman result, the top 100 rankings of the tmi.pm, user2.pm, and ll.pm modules contain many similar bigrams. I think this helps show the validity of both the tmi.pm and user2.pm modules, and their usefulness when searching for bigrams that contain words occuring together more often than we might expect. On the contrary, the mi.pm module is very good at finding bigrams with words for some "greater" meaning. The mi.pm module does not consider the sheer number of bigrams in its computation so if some bigram occurs more often than is expected from the marginal counts, it will be ranked higher. This test does not care that in the scheme of the entire body of text, a few occurances of that bigram might not be quantitatively important. Table 1: The NSP output for the top 50 bigrams ranked by tmi ------------------------------------------------------------- 1152569 .<>The<>1 0.0142 5965 49673 6920 of<>the<>2 0.0079 9623 36043 62690 .<>He<>3 0.0070 2775 49673 2991 .<>It<>4 0.0049 1999 49673 2184 in<>the<>5 0.0047 5557 19581 62690 ,<>and<>6 0.0045 6332 58974 27952 .<>In<>7 0.0040 1598 49673 1736 .<>But<>8 0.0033 1298 49673 1371 ,<>but<>9 0.0032 1882 58974 3006 the<>the<>10 0.0031 3 62690 62690 United<>States<>11 0.0028 393 456 447 .<>This<>12 0.0027 1087 49673 1175 to<>be<>13 0.0025 1697 25725 6340 ,<>,<>13 0.0025 61 58974 58974 on<>the<>14 0.0024 2308 6405 62690 .<>the<>14 0.0024 6 49673 62690 the<>.<>14 0.0024 15 62690 49674 .<>,<>15 0.0023 1 49673 58974 had<>been<>15 0.0023 760 5096 2470 didn<>t<>16 0.0022 392 394 2151 .<>They<>17 0.0021 847 49673 901 ,<>.<>17 0.0021 57 58974 49674 don<>t<>17 0.0021 387 388 2151 .<>There<>18 0.0020 811 49673 901 .<>I<>18 0.0020 1832 49673 5877 New<>York<>18 0.0020 296 554 302 have<>been<>18 0.0020 651 3887 2470 .<>She<>18 0.0020 817 49673 895 has<>been<>19 0.0019 566 2424 2470 .<>A<>19 0.0019 1007 49673 1536 .<>.<>19 0.0019 2 49673 49674 .<>And<>19 0.0019 774 49673 853 would<>be<>20 0.0016 613 2675 6340 will<>be<>20 0.0016 592 2201 6340 of<>,<>21 0.0015 26 36043 58974 U<>S<>21 0.0015 236 324 407 .<>If<>21 0.0015 629 49673 721 did<>not<>21 0.0015 439 991 4421 .<>We<>21 0.0015 656 49673 768 may<>be<>22 0.0014 458 1286 6340 can<>be<>22 0.0014 507 1887 6340 it<>is<>22 0.0014 883 6873 9969 more<>than<>23 0.0013 388 2125 1793 It<>is<>23 0.0013 577 2184 9969 he<>had<>23 0.0013 670 6799 5096 at<>the<>23 0.0013 1508 4966 62690 .<>When<>23 0.0013 528 49673 580 the<>same<>23 0.0013 595 62690 677 of<>.<>23 0.0013 25 36043 49674 from<>the<>23 0.0013 1352 4190 62690 ------------------------------------------------ Table 2: The NSP output for the top 50 bigrams ranked by the KS module ---------------------------------------------------------------------- 1152569 of<>the<>1 0.0066 9623 36043 62690 .<>The<>2 0.0049 5965 49673 6920 ,<>and<>3 0.0043 6332 58974 27952 in<>the<>4 0.0039 5557 19581 62690 the<>the<>5 0.0030 3 62690 62690 ,<>,<>6 0.0026 61 58974 58974 the<>.<>7 0.0023 15 62690 49674 .<>the<>7 0.0023 6 49673 62690 .<>He<>7 0.0023 2775 49673 2991 .<>,<>8 0.0022 1 49673 58974 ,<>.<>8 0.0022 57 58974 49674 .<>.<>9 0.0019 2 49673 49674 .<>It<>10 0.0017 1999 49673 2184 to<>the<>10 0.0017 3402 25725 62690 on<>the<>10 0.0017 2308 6405 62690 of<>,<>11 0.0016 26 36043 58974 ,<>but<>12 0.0015 1882 58974 3006 .<>I<>13 0.0014 1832 49673 5877 .<>In<>14 0.0013 1598 49673 1736 to<>be<>14 0.0013 1697 25725 6340 of<>.<>14 0.0013 25 36043 49674 ,<>of<>15 0.0012 418 58974 36043 .<>But<>16 0.0011 1298 49673 1371 for<>the<>16 0.0011 1756 8833 62690 at<>the<>16 0.0011 1508 4966 62690 to<>,<>16 0.0011 42 25725 58974 a<>,<>17 0.0010 8 21998 58974 a<>the<>17 0.0010 2 21998 62690 ,<>he<>17 0.0010 1523 58974 6799 from<>the<>17 0.0010 1352 4190 62690 .<>and<>17 0.0010 4 49673 27952 with<>the<>17 0.0010 1477 7007 62690 and<>,<>17 0.0010 304 27952 58974 and<>.<>17 0.0010 3 27952 49674 the<>a<>17 0.0010 1 62690 21998 the<>in<>18 0.0009 5 62690 19581 .<>This<>18 0.0009 1087 49673 1175 to<>.<>18 0.0009 47 25725 49674 by<>the<>18 0.0009 1316 5124 62690 in<>a<>19 0.0008 1318 19581 21998 in<>,<>19 0.0008 73 19581 58974 of<>and<>19 0.0008 9 36043 27952 a<>.<>19 0.0008 15 21998 49674 .<>a<>19 0.0008 4 49673 21998 .<>A<>19 0.0008 1007 49673 1536 of<>a<>20 0.0007 1467 36043 21998 .<>She<>20 0.0007 817 49673 895 as<>a<>20 0.0007 909 6691 21998 and<>of<>20 0.0007 111 27952 36043 it<>is<>20 0.0007 883 6873 9969 ------------------------------------------------------------------------- +------------------------------------------------------------------------+ | Experiment Two | +------------------------------------------------------------------------+ TOP 50 COMPARISON Upon viewing the top 50 rankings for the first time I was rather surprised to see the results I obtained. The rankings were clearly dependent on the first two words in the trigrams, and this produced fairly different ranks for the trigrams. The top 50 rankings produced by the ll3 module are dominated by the ".<>The<>*" trigram (where * is meant to represent any word). The user3 module rankings were dominated by the "of<>the<>*" trigram. The actual output has been included at the end of this section as table(s) 3 & 4. I think this isn't so much a flaw in the modules, but rather a tendency of trigrams. For the ll3 module, it's not much different from the 2-word version and bases it's final value on the sheer number of times the trigrams components occur. It's easy to see from the output that the bigram ".<>The" occurs very often and it has an relatively high count of 49673. In fact the output is essentially the 2-word ranking, but each bigram from the two word has a third word added on in the rankings. Due to space concerns I can not show this in this report, but I have checked it and it is the case. It is the same situation for the user3.pm module. It ranked the words in the same general order as the 2-word version, but appended a third word for each bigram. This seems very strange, but because the calculated values do differ between the trigrams in the ll3 rankings, but the precision isn't large enough to show a big difference in the user3 rankings. CUTOFF POINT Similar to the 2-word case I feel determining a cutoff value for the trigram rankings is very difficult. It seems as though the trigram rankings are merely an expansion on the bigram rankings, so for the same reasons a cutoff value would not mean much. One difference with the trigram rankings however is there starts to emerge a semantic meaning for the trigrams that isn't as prevelant with the bigrams. This is probably due to a trigram being able to convey more meaning with its three words than a bigram. Also this ranking is still based on sheer frequency counts over meaning, so any cutoff falls prey to the wishes of the person assigning it. OVERALL RECOMMENDATION Again, similar to the bigram case in-order to make a recommendation it must be known what the user is looking for. With the absence of the pointwise mutual information module, they are essentially stuck with a ranking based on frequency (and not meaning). The strange correlation between the bigram and trigram cases makes me wonder if there's either an error in my mathematical procedure, or if these two tests do not provide as meaningful information for trigrams as they do bigrams. In either case, I feel they both provide some meaningful information and do find trigrams that occur more frequently than a person might expect. However, if the goal was to find trigrams with a higher semantic meaning, then I think a person would be better off using a different statistic. Table 3: The top 50 rankings produced by the ll3 module -------------------------------------------------------------------------- 1152568 .<>The<>would<>1 32790.1157 2 49673 6920 2675 5965 291 2 .<>The<>same<>2 32774.7950 28 49673 6920 677 5965 32 31 .<>The<>New<>3 32773.2097 14 49673 6920 554 5965 27 29 .<>The<>will<>4 32772.9476 1 49673 6920 2201 5965 196 1 .<>The<>United<>5 32772.5431 14 49673 6920 456 5965 15 17 .<>The<>I<>6 32772.2869 2 49673 6920 5877 5965 341 2 .<>The<>American<>7 32769.1592 18 49673 6920 591 5965 22 21 .<>The<>Man<>8 32768.3984 1 49673 6920 64 5965 1 7 .<>The<>Prince<>9 32763.9745 3 49673 6920 33 5965 3 10 .<>The<>Lord<>10 32762.9355 1 49673 6920 98 5965 2 6 .<>The<>world<>11 32762.5045 11 49673 6920 739 5965 18 13 .<>The<>Commission<>12 32761.6256 3 49673 6920 78 5965 4 9 .<>The<>Walnut<>13 32761.1441 1 49673 6920 8 5965 1 6 .<>The<>addition<>14 32760.9312 1 49673 6920 138 5965 80 1 .<>The<>party<>15 32760.5849 4 49673 6920 207 5965 4 4 .<>The<>Government<>16 32760.4715 10 49673 6920 157 5965 10 15 .<>The<>city<>17 32759.9822 6 49673 6920 290 5965 7 8 .<>The<>National<>18 32759.5559 6 49673 6920 160 5965 6 8 .<>The<>children<>19 32759.2380 10 49673 6920 359 5965 14 10 .<>The<>door<>20 32758.8067 5 49673 6920 321 5965 6 6 .<>The<>Space<>21 32758.6269 1 49673 6920 13 5965 1 5 .<>The<>work<>22 32758.1267 15 49673 6920 758 5965 31 15 .<>The<>great<>23 32757.8663 19 49673 6920 615 5965 29 20 .<>The<>Holy<>24 32757.7263 2 49673 6920 32 5965 2 6 .<>The<>state<>25 32757.5994 9 49673 6920 603 5965 18 13 .<>The<>place<>26 32757.4519 10 49673 6920 545 5965 17 11 .<>The<>higher<>27 32757.1897 3 49673 6920 154 5965 3 3 .<>The<>President<>28 32757.1333 21 49673 6920 290 5965 22 28 .<>The<>sun<>29 32757.0760 7 49673 6920 117 5965 7 7 .<>The<>answer<>30 32756.9679 4 49673 6920 149 5965 10 10 .<>The<>Soviet<>31 32756.8918 4 49673 6920 138 5965 4 5 .<>The<>spirit<>32 32756.7985 3 49673 6920 162 5965 3 4 .<>The<>book<>33 32756.7591 6 49673 6920 184 5965 7 6 .<>The<>present<>34 32756.7460 8 49673 6920 396 5965 17 14 .<>The<>City<>35 32756.6942 4 49673 6920 134 5965 4 5 .<>The<>job<>36 32756.6888 4 49673 6920 240 5965 5 4 .<>The<>Carleton<>37 32756.6658 1 49673 6920 28 5965 1 4 .<>The<>room<>38 32756.5881 6 49673 6920 380 5965 10 6 .<>The<>various<>39 32756.5876 5 49673 6920 195 5965 6 5 .<>The<>very<>40 32756.5023 8 49673 6920 772 5965 20 8 .<>The<>direction<>41 32756.4817 3 49673 6920 132 5965 3 3 .<>The<>total<>42 32756.4491 9 49673 6920 208 5965 10 10 .<>The<>bed<>43 32756.4458 3 49673 6920 131 5965 3 3 .<>The<>water<>44 32756.4428 7 49673 6920 452 5965 13 7 .<>The<>list<>45 32756.4096 3 49673 6920 130 5965 3 3 .<>The<>meaning<>46 32756.2988 3 49673 6920 127 5965 3 3 .<>The<>ball<>47 32756.2627 4 49673 6920 107 5965 4 4 .<>The<>physical<>48 32756.2611 3 49673 6920 126 5965 3 3 .<>The<>audience<>49 32756.1933 8 49673 6920 115 5965 8 8 .<>The<>principle<>50 32756.1886 2 49673 6920 107 5965 2 4 -------------------------------------------------------------------------- Table 4: The top 50 rankings produced by the user3 module -------------------------------------------------------------------------- 1152568 of<>the<>.<>1 0.0067 2 36043 62690 49674 9623 2086 15 of<>the<>next<>1 0.0067 4 36043 62690 362 9623 5 185 of<>the<>end<>1 0.0067 3 36043 62690 417 9623 10 204 of<>the<>only<>1 0.0067 1 36043 62690 1642 9623 11 204 of<>the<>corrosive<>2 0.0066 1 36043 62690 4 9623 1 1 of<>the<>Walkers<>2 0.0066 1 36043 62690 1 9623 1 1 of<>the<>gain<>2 0.0066 3 36043 62690 75 9623 5 6 of<>the<>denominational<>2 0.0066 2 36043 62690 9 9623 4 4 of<>the<>24<>2 0.0066 1 36043 62690 74 9623 1 4 of<>the<>cathedrals<>2 0.0066 1 36043 62690 3 9623 1 2 of<>the<>Lewis<>2 0.0066 1 36043 62690 63 9623 3 3 of<>the<>fortune<>2 0.0066 1 36043 62690 25 9623 3 3 of<>the<>nineteenth<>2 0.0066 14 36043 62690 54 9623 17 31 of<>the<>21<>2 0.0066 5 36043 62690 63 9623 6 8 of<>the<>stall<>2 0.0066 4 36043 62690 17 9623 4 9 of<>the<>pictures<>2 0.0066 3 36043 62690 63 9623 6 9 of<>the<>methods<>2 0.0066 1 36043 62690 131 9623 4 22 of<>the<>20<>2 0.0066 1 36043 62690 148 9623 6 4 of<>the<>destruction<>2 0.0066 2 36043 62690 40 9623 3 12 of<>the<>stag<>2 0.0066 1 36043 62690 8 9623 1 2 of<>the<>destructive<>2 0.0066 2 36043 62690 28 9623 2 4 of<>the<>greater<>2 0.0066 1 36043 62690 180 9623 9 24 of<>the<>muscles<>2 0.0066 1 36043 62690 31 9623 2 5 of<>the<>Force<>2 0.0066 1 36043 62690 31 9623 1 2 of<>the<>basement<>2 0.0066 1 36043 62690 30 9623 1 15 of<>the<>establishment<>2 0.0066 1 36043 62690 52 9623 1 26 of<>the<>stem<>2 0.0066 3 36043 62690 30 9623 3 6 of<>the<>quantum<>2 0.0066 1 36043 62690 5 9623 1 2 of<>the<>17<>2 0.0066 1 36043 62690 53 9623 1 2 of<>the<>Old<>2 0.0066 12 36043 62690 93 9623 13 38 of<>the<>celebration<>2 0.0066 1 36043 62690 14 9623 1 2 of<>the<>regiment<>2 0.0066 2 36043 62690 22 9623 2 10 of<>the<>claim<>2 0.0066 3 36043 62690 97 9623 7 10 of<>the<>views<>2 0.0066 1 36043 62690 49 9623 3 7 of<>the<>13<>2 0.0066 2 36043 62690 65 9623 4 5 of<>the<>12<>2 0.0066 1 36043 62690 146 9623 9 12 of<>the<>David<>2 0.0066 1 36043 62690 55 9623 4 2 of<>the<>mushroom<>2 0.0066 1 36043 62690 2 9623 1 1 of<>the<>white<>2 0.0066 5 36043 62690 273 9623 11 31 of<>the<>11<>2 0.0066 1 36043 62690 96 9623 3 5 of<>the<>10<>2 0.0066 3 36043 62690 251 9623 9 14 of<>the<>Criminal<>2 0.0066 2 36043 62690 4 9623 2 3 of<>the<>unimproved<>2 0.0066 1 36043 62690 2 9623 1 1 of<>the<>vehicle<>2 0.0066 1 36043 62690 33 9623 3 5 of<>the<>unification<>2 0.0066 1 36043 62690 9 9623 1 4 of<>the<>really<>2 0.0066 1 36043 62690 267 9623 5 7 of<>the<>Germans<>2 0.0066 2 36043 62690 27 9623 2 15 of<>the<>Eight<>2 0.0066 1 36043 62690 11 9623 1 1 of<>the<>Iliad<>2 0.0066 6 36043 62690 14 9623 6 13 of<>the<>parent<>2 0.0066 1 36043 62690 17 9623 3 3 -------------------------------------------------------------------------- +-------------------------------------------------------------------------+ | References and More Information | +-------------------------------------------------------------------------+ (1) N-Gram Statistics Package - http://www.d.umn.edu/~tpederse/nsp.html (2) 100 Statistical Tests by Gopal K. Kanji (ISBN: 0-8039-8704-8) - The Kolmogorov-Smirnov test is discussed on page 67 - A table of 'D' values is found on page 186 ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ \\\|/// \\ ~ ~ // ( @ @ ) **********************************-oOOo-(_)-oOOo-************************************ CS 8761 - Natural Language Processing - Dr. Ted Pedersen Assignment 2 : THE GREAT COLLOCATION HUNT Due : Friday , October 11 , Noon Author : Anand Takale ( ) ***************************************-OoooO-*************************************** .oooO ( ) ( ) ) / \ ( (_/ \_) Objective : To investigate various measures of association that can be used to identify --------- collocations in large corpora of text. In particular identify and implement a measure that can be used with 2 and 3 word sequences, and compare this method with some other standard measures.(i.e. To identify a measure that is suitable for both 2 or 3 word sequences that is not already a part of NSP Introduction : For this assignment we have to implement "true" mutual information for 2 word ------------- sequences.Then we have to identify a measure that is not currently supported by the NSP package.This measure should work for 2 as well as 3 word sequences. We also have to implement the log likelihood ratio test for 3 word sequences. After all the tests have been implemented we have to compare the 50 top bigrams produced by each test. Also we have to find out whether any cutoff point for scores occurs in the results. Then we have to compare the tests using the rank.pl module of the NSP package. CORPUS of text used : vikram.txt , crusoe.txt , donq.txt , holmes.txt , hamlet.txt , nafta.txt ------------------- crime.txt , twist.txt , wstra10.txt ( Project Gutenberg - http://promo.net/pg/ ) All the files were placed in a directory called corpus and the tests were run on this directory. ---------------------------------------------------------------------------------------------- Experiment 1 ---------------------------------------------------------------------------------------------- 1. Implement "true" mutual information for 2 word sequences. (tmi.pm) 2. Identify a measure that is currently not supported by NSP that is suitable for discovering 2 and 3 word collocations. (user2.pm) --------------------------------------------------------------------------------------------- Part 1 : Implementing "true" mutual information for 2 word sequences. ( Module tmi.pm ) ------ "True" mutual information is a measurement of dependence between X and Y. "True" mutual information is given by the formula : I(X:Y) = (SUM X)(SUM Y) p(x,y) log [ p(x,y)/ (p(x)*p(y)) ] where p(x,y) is the probability that the word x is the first token of the bigram and word y is the second token of the bigram, p(x) is the probability that x is the first token of the bigram p(y) is the probability that y is the second token of the bigram The following values are obtained as the output of count.pl program. (1) The number of times both x & y occured together with y after x --- f(x,y) (2) The number of times x occured as the first token of the bigram --- f(x) (3) The number of times y occured as the second token of the bigram --- f(y) Once you have the frequencies you can biuld the 2 way contingency table as follows : The 2 way contingency table is as follows : _ y | y | ---------|----------- x | (total x) | ---------|----------- _ | _ x | (total x) | _ (total y) (total y) (total no. of bigrams) We have the values of total no. of bigrams and values for xy , total x and total y directly from output of count.pl .The rest of the values can be calculated and the table can be formulated. After the table has been formulated there are four different values available _ _ _ _ xy , xy . xy , x y We also get the total number of the bigrams that occured in the input text i.e CORPUS so we can find the probabilities as follows : p(x,y) = f(x,y) / totalBigrams p(x) = f(x) / totalBigrams p(y) = f(y) / totalBigrams Taking a summation of all the four values the "TRUE" mutual information for that bigram is calculated. After the "TRUE" mutual information has been calculated for each and every bigram the output of statistic.pl is compared with other outputs of statistic.pl obtained by using different measures of associations. The output files obtained from statistic.pl are compared using rank.pl. Rank.pl will output a floating point number between -1 and 1. A return of '1' indicates a perfect match in rankings, '-1' a completely reversed ranking and '0' a pair of rankings that are completely unrelated to each other. Numbers that lie between these numbers indicate various degrees of relatedness / un-relatedness / reverse-relatedness. Sample Calculation of TMI for a bigram : ---------------------------------------- The sample calculation for the bigram with the highest score in the true mutual information are as follows : 1425725 ,<>and<>1 0.0199 17218 103199 42199 We can see that total number of bigrams occuring in the corpus are 1425725. The bigrams with tokens (,) and (and) has the highest score of 0.0199 We also see that xy = 17218 x = 103199 y = 42199 From these values we can also calculate the other values : Filling the 2*2 table we get: _ y | y | | | x 17218 | 85981 | 103199 | | -------------|----------|-------- _ | | x 24981 | 1297545 | 1322526 | | 42199 | 1383526 | 1425725 From this table we can compute the TRUE mutual information of the bigram as follows The TRUE mutual information is calculated by the following formula : TMI = SUM over x,y ( p(x,y) * log [ p(x,y) / (p(x)*p(y))] ) Now summation over x and y means taking all the four combinations and adding them up. _ _ _ _ xy , xy , xy , x y After running the count.pl on CORPUS the file of bigrams called as corpus.cnt was created. Now the scores of different bigram were calculated by running statistic.pl with tmi.pm embedded into it as the statistic measure. TOP 50 RANKS : ------------- following are the top 15 rankings produced by the tmi.pm statistic. ,<>and<> 1 0.0199 17218 103199 42199 Don<>Quixote<>2 0.0150 2182 2792 2189 ,<>,<> 3 0.0081 1 103199 103199 of<>the<> 4 0.0080 8026 35692 61040 .<>The<> 5 0.0078 2827 49770 3634 ;<>and<> 6 0.0065 4199 14797 42199 .<>He<> 7 0.0062 2136 49770 2533 I<>am<> 8 0.0055 1450 21441 1679 don<>t<> 9 0.0051 747 747 2127 Mr<>.<> 10 0.0050 1467 1469 49770 in<>the<> 11 0.0048 4699 19621 61040 the<>,<> 12 0.0047 1 61040 103199 ;<>but<> 13 0.0042 1597 14797 5276 .<>It<> 14 0.0041 1427 49770 1731 .<>But<> 15 0.0039 1350 49770 1625 We can see that at least all the top 15 rankings have different scores. We can also observe that most of the top ranks produced by the tmi test consisted of punctuation marks as tokens in them. So we can conclude that tmi test doesn't produce interesting collocations. i.e. looking at the top ranked bigrams we cant say anything about the collocations i.e. we have little idea about which two words occur the most. Although the words Don Quixote come together often they are not interesting as they form a name of a person. Also we can see that some collocations are present in the top ranks. Collocations like 'I am' and'in the' occur but they are at a considerable lower rank. CUTOFF POINT : ------------ Looking at the corpus.tmi we can observe that the highest ranked bigrams have unique values. When the ranks go around 30 there are more than one bigrams with the same ranks. Hence we can pinpoint the cutoff point at around ranks 25 to 30. We can observe from this readings that the there are very few bigrams occuring more more of times and many bigrams that appear very few times. This is in accordance with the Zipf's Law. ---------------------------------------------------------------------------------------------------- Part 2 : Identify a measure currently not supported by NSP package. The measure chosen was ------- JT Test. The JT test is a variation of the U test.It is a popular test proposed by Terpstra (1952) and Jonckheere (1954) independent of each other. The JT test uses a Mann-Whitney statistic Uij for the two sample problem comparing samples i and j. The test statistic is constructed by adding all the Uij's . Thus the test statistic is JT coeff = U12 + U13 + .... + U1k + U23 + U24 + ..... + U2k + ... + Uk-1k The Mann Whitney U test based on the idea that the particular pattern exhibited when the X and Y random variables are arranged together in increasing order of magnitude provides information about the relationship between their populations. The Mann Whitney U statistic is defined the number of times a Y precedes an X in the combined ordered arrangement of the two independent random samples. The formula for calculatin the JU statistic is as follows : U = [ P(x,y) - p ] / [ sqrt ( p(1-p)/n ) ] where P(x,y) is the probability that x and y occur together and p = P(x) * P(y) i.e. multiplication of individual properties In the JT test first the 2*2 table was made the samw as in TRUE mutual information module. The U values were calculated from this table and were added to get the JT coefficient. After running the statistic.pl on the corpus with user2.pm as the statistic measure the following bigrams were found to have the highest ranks. Now this test ranks the bigrams on the probability of both the tokens in the bigram occuring together. We saw that in tmi.pm the bigram which occured the most number of times was ranked first. However in this case the bigram with highest rank occured only once. But the JT test gives a coefficient which tells us whether the probability of the two tokens occuring together is more or less. So we see that the probability of the tokens occuring together is 100% in the first two ranks. Plaza<>Mayor<>1 1702365800.3925 1 2 2 ISO<>IEC<>1 1702365800.3925 1 2 2 Also we see that the most of the bigrams with tokens appearing together occur very few times. So again the Zipfs Law holds. Most of the bigrams occur only once whereas only few bigrams occur many times. We also conclude that the two test JT test and TMI are way different from each other. TMI concentrates on most number of occurences while JT concentrates more on probability of tokens in a bigram occuring together. This is also proved by running the two tests with rank.pl. The coefficient obtained is -0.9037 which proves that the TMI and the JT tests are different from each other. RANK COMPARISONS: ---------------- The rank of the tmi test with other tests is as follows : (1) Pointwise Mutual Information (mi.pm) : -0.9031 (2) Log Likelihood (ll.pm) : -0.9024 The rank of the JT Test with other tests is as follows : (1) Pointwise Mutual Information (mi.pm) : 0.6582 (2) Log Likelihood (ll.pm) : 0.3138 OVERALL RECOMMENDATION: We can see that to get interesting collocation the True Mutual INformation test would be more useful if it was run on a corpus without punctuatiojn marks. However in True Mutual information even if a Bigram occured 2 times and the tokens appeared 100 times each and the bigrams appeared 2 times and the tokens appeared 2 times each have the same score. Thus if we want to see which collocation have maximum probablity of appearing together then we should go for the JT test. ----------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------- ------------- EXPERIMENT 2: ------------- 1. Implement a module named ll3.pm that performs the log likelihood ratio test for 3 word sequences. Assume that the null hypothesis of the test is as follows : p(w1).p(w2).p(w3) = p(w1,w2,w3) 2. Create your 3 word version of user2.pm and call it user3.pm In the loglikelihood test for three word sequences there was a need to build the 3 way contingency table first. This contingency table was build using the basic Demorgans law. The formula used for loglikehood test is as follows : log likelihood ratio = 2 * SUM [ observed * log ( observed / estimated ) ] observed = f(xyz) estimated = (f(x)*f(y)*f(z))/(n*n) The count.pl was run on the corpus to produce corpus3.cnt for trigrams. Then statistic.pl was run on both user3.pm as well as ll3.pm The scores obtained were ranked by running rank.pl on corpus.ll3 and corpus.user3 The rank of the user3.pm abd ll3.pm was -0.2637 showing that the two tests are very different from each other. Top 50 ranks : =----------= Again in this case the top 50 ranks were different because the JT test takes into consideration the probability that all the three tokens in the trigram appear together and they do not appear elsewhere.The proof that the tests are very different from each other is proved by the rank coefficient obtained. Cutoff point : ------------ The same cutoff point was observed in the case of user3.pm as that of user2.pm While in the case of loglikelihood there were many unique ranks at the beginnning i.e higher ranks whereas there were many trigrams having same ranks at lower positions. Overall Recommendations : ----------------------- The JT test provides a way to tell ypu about whether the tokens in a n-gram occur only in that particular n-gram and nowhere else in the corpus . However the JT test does not output interesting collocations. ------------ References : ------------ 1. Readme and Newstats file of the NSP package on Dr. Ted Pederson's web page http://www.d.umn.edu/~tpederse 2. CORPUS - taken from Project Gutenberg http://promo.net/pg/ 3. Foundations of Statistical Natural Language Processing - Cristopher Manning and Hinrich Schutze 4. Non Parametric Statistical Inference - Formula for JT test - Jean Dickinson Gibbons , Subhabrata Chakraborti ( Chapters 7 and 11 ) ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++ ***************** Name: Nan Zhang * Date: 10/10/02 * Class: CS8761 * ***************** ****************** * * * Experience 1 * * * ****************** ***************************************************************************************************************************** TOP 50 COMPARISON: ***************************************************************************************************************************** The Corpus I use is <>. Top 50 by tmi.pm .<>D<>31056 -1163809327.4667 310 11718 367 with<>a<>31057 -1203381402.8231 295 2049 4561 him<>,<>31058 -1210841293.8821 263 1470 21495 M<>.<>31059 -1220667819.1382 331 331 11719 by<>the<>31060 -1224848032.7068 290 1055 12566 then<>,<>31061 -1263433879.9635 299 639 21495 me<>,<>31062 -1318917540.1362 291 1356 21495 said<>the<>31063 -1336223882.2826 300 1960 12566 on<>the<>31064 -1344955969.0981 325 952 12566 I<>am<>31065 -1363879681.2058 417 3849 440 ,<>to<>31066 -1374335186.5698 265 21495 6501 ,<>which<>31067 -1380125280.8556 303 21495 1495 you<>,<>31068 -1396187976.1075 285 3370 21495 he<>had<>31069 -1398364320.9748 374 2689 1856 with<>the<>31070 -1398432295.5183 314 2049 12566 .<>But<>31071 -1420887734.1110 380 11718 433 .<>You<>31072 -1436573948.3186 376 11718 527 .<>And<>31073 -1471446439.4109 389 11718 495 ,<>then<>31074 -1506154159.7267 363 21495 639 to<>be<>31075 -1542982342.1917 391 6501 1359 ,<>with<>31076 -1545714947.3050 333 21495 2049 of<>a<>31077 -1551260754.0983 348 6321 4561 Athos<>,<>31078 -1556033698.3020 361 957 21495 I<>have<>31079 -1584173054.4934 424 3849 1460 of<>his<>31080 -1625082599.1097 383 6321 2910 the<>cardinal<>31081 -1673144878.2925 443 12566 518 ,<>but<>31082 -1732749001.5084 397 21495 1208 .<>He<>31083 -1745646036.6340 468 11718 521 and<>the<>31084 -1754961703.7808 368 5350 12566 ,<>you<>31085 -1800977181.3863 376 21495 3370 ,<>in<>31086 -1847508856.0975 389 21495 3146 ,<>that<>31087 -1851306368.0374 389 21495 3225 ;<>and<>31088 -1861098232.2333 454 2829 5350 ,<>my<>31089 -1909670559.2590 435 21495 1412 ;<>but<>31090 -1994045442.5270 584 2829 1208 ,<>as<>31091 -2037463998.2530 462 21495 1578 at<>the<>31092 -2046655568.0214 493 1493 12566 ,<>who<>31093 -2144662094.9485 509 21495 1055 ,<>he<>31094 -2290124564.3685 499 21495 2688 Artagnan<>,<>31095 -2515455299.3788 574 1827 21495 ,<>the<>31096 -2889357459.5828 561 21495 12566 .<>I<>31097 -2927902472.6048 669 11718 3849 .<>The<>31098 -3213055003.6713 859 11718 982 in<>the<>31099 -3368643268.6223 791 3146 12566 to<>the<>31100 -3415678714.8315 748 6501 12566 ,<>I<>31101 -3735454005.1049 824 21495 3849 d<>Artagnan<>31102 -4344794727.8653 1466 1494 1827 ,<>said<>31103 -5114710934.5145 1246 21495 1960 of<>the<>31104 -6867623309.2519 1615 6321 12566 ,<>and<>31105 -12470406843.4868 3002 21495 5350 Top 50 by user2.pm !<>a<>10700 -0.9402 1 1896 4561 have<>.<>10701 -0.9410 2 1460 11719 and<>,<>10702 -0.9417 14 5350 21495 ,<>s<>10703 -0.9420 2 21495 783 you<>he<>10704 -0.9427 1 3370 2688 his<>in<>10705 -0.9433 1 2910 3146 of<>at<>10706 -0.9453 1 6321 1493 you<>his<>10707 -0.9470 1 3370 2910 ,<>But<>10708 -0.9474 1 21495 433 had<>and<>10709 -0.9479 1 1856 5350 of<>in<>10710 -0.9483 2 6321 3146 for<>to<>10711 -0.9501 1 1593 6501 to<>for<>10711 -0.9501 1 6501 1593 from<>.<>10712 -0.9505 1 877 11719 ?<>and<>10713 -0.9506 1 1960 5350 ,<>know<>10714 -0.9509 1 21495 465 with<>and<>10715 -0.9527 1 2049 5350 of<>.<>10716 -0.9529 7 6321 11719 at<>,<>10717 -0.9543 3 1493 21495 or<>,<>10718 -0.9549 1 507 21495 ,<>He<>10719 -0.9561 1 21495 521 in<>I<>10720 -0.9570 1 3146 3849 !<>the<>10721 -0.9575 2 1896 12566 .<>will<>10722 -0.9583 1 11718 1043 by<>.<>10723 -0.9588 1 1055 11719 his<>,<>10724 -0.9610 5 2910 21495 to<>with<>10725 -0.9611 1 6501 2049 we<>,<>10726 -0.9636 1 631 21495 in<>and<>10727 -0.9691 1 3146 5350 .<>which<>10728 -0.9708 1 11718 1495 of<>;<>10729 -0.9709 1 6321 2829 to<>.<>10730 -0.9736 4 6501 11719 from<>,<>10731 -0.9737 1 877 21495 of<>,<>10732 -0.9750 7 6321 21495 of<>to<>10732 -0.9750 2 6321 6501 ,<>!<>10733 -0.9758 2 21495 1896 had<>.<>10734 -0.9764 1 1856 11719 .<>said<>10735 -0.9777 1 11718 1960 .<>?<>10735 -0.9777 1 11718 1960 to<>,<>10736 -0.9792 6 6501 21495 the<>said<>10736 -0.9792 1 12566 1960 ,<>me<>10737 -0.9830 1 21495 1356 my<>,<>10738 -0.9837 1 1412 21495 .<>and<>10739 -0.9838 2 11718 5350 ,<>him<>10740 -0.9843 1 21495 1470 .<>that<>10741 -0.9864 1 11718 3225 ,<>?<>10742 -0.9882 1 21495 1960 .<>of<>10743 -0.9931 1 11718 6321 ,<>.<>10744 -0.9981 1 21495 11719 .<>,<>10744 -0.9981 1 11718 21495 The output of tmi.pm is just as what I expect. The association is meaningful and theirs ranks are almost the same as that in the output of count.pl. The output of user2.pm is like that of mi.pm very much. But the rank is to my surprise since the first-ranked is ".<>,<>" and something that are meaningless rank highly. But I found that in the output of count.pl, rank of them are also very high. So I think it is the problem of the flaw of the measure instead of the program. So, since tmi.pm seems better at identifying significant and interesting collocations, I think the measure of tmi is better than user2. ***************************************************************************************************************************** CUTOFF POINT: ***************************************************************************************************************************** I don't think I can find obvious cutoff points in the lists of bigrams as ranked by NSP for both tmi.pm and user2.pm, because the scores of both lists are distributed continuously instead of discretely. So the curve of score-rank should be smooth. And that means the the collocations of large corpus are distributed dispersedly instead of concentratedly. ***************************************************************************************************************************** RANK COMPARISON: ***************************************************************************************************************************** Judged from the top 50 comparision, tmi.pm is more like ll.pm, and user2.pm is more like mi.pm. What I observed are: perl rank.pl output2.ll output2.tmi Rank correlation coefficient = 0.1800 perl rank.pl output2.ll output2.user2 Rank correlation coefficient = 0.6996 perl rank.pl output2.mi output2.tmi Rank correlation coefficient = 0.7158 perl rank.pl output2.mi output2.user2 Rank correlation coefficient = 0.9204 It seems that user2.pm is more like both ll.pm and mi.pm than tmi.pm. The rank correlation coefficient between ll.pm and tmi.pm is very low because tmi.pm is not like ll.pm when the rank grows large. I think this is because that though they are similar when p(x,y) is large, when p(x,y) becomes very small, p(x,y)log(p(x,y)/p(x)p(y)) is much smaller than p(x,y). ***************************************************************************************************************************** OVERALL RECOMMENDATION: ***************************************************************************************************************************** I think tmi.pm is the best for identifying significant collocations in large corpora of text. The modules of mi.pm and user2.pm don't work well when p(x,y) is large, and they both rank some meaningless association highly. Though tmi.pm and ll.pm perform similarly when p(x,y) is large, when p(x,y) becomes very small, since p(x,y)log(p(x,y)/p(x)p(y)) grows smaller more quickly than p(x,y), which embodies the distribution of collocations in large corpora, tmi.pm perform better than ll.pm. So tmi.pm is the best for identifying significant collocations in large corpora of text. ***************************************************************************************************************************** ****************** * * * Experience 2 * * * ****************** ***************************************************************************************************************************** TOP 50 RANK: ***************************************************************************************************************************** The Corpus I use is <>. Top 50 by ll3.pm said<>d<>Artagnan<>1 -6812109.6494 322 1960 1494 1827 322 322 1466 .<>d<>Artagnan<>2 -6812147.0776 61 11718 1494 1827 64 371 1466 d<>Artagnan<>,<>3 -6812880.4280 523 1494 1827 21495 1466 531 574 cried<>d<>Artagnan<>4 -6813309.2328 91 447 1494 1827 91 91 1466 replied<>d<>Artagnan<>5 -6813411.5904 74 378 1494 1827 74 74 1466 d<>Artagnan<>.<>6 -6813426.2759 252 1494 1827 11719 1466 258 252 Monsieur<>d<>Artagnan<>7 -6813565.4285 56 454 1494 1827 56 56 1466 d<>Artagnan<>;<>8 -6813615.2360 95 1494 1827 2829 1466 96 95 the<>d<>Artagnan<>9 -6813664.3558 2 12566 1494 1827 3 2 1466 asked<>d<>Artagnan<>10 -6813687.7451 28 208 1494 1827 28 28 1466 d<>Artagnan<>the<>11 -6813694.4313 8 1494 1827 12566 1466 8 8 d<>Artagnan<>s<>12 -6813703.9910 39 1494 1827 783 1466 40 43 d<>Artagnan<>was<>13 -6813709.5038 49 1494 1827 2511 1466 50 72 d<>Artagnan<>had<>14 -6813714.3382 49 1494 1827 1856 1466 50 62 !<>d<>Artagnan<>15 -6813737.2058 2 1896 1494 1827 2 27 1466 ,<>d<>Artagnan<>16 -6813737.6726 161 21495 1494 1827 161 161 1466 dear<>d<>Artagnan<>17 -6813741.0051 19 188 1494 1827 19 19 1466 d<>Artagnan<>took<>18 -6813744.0007 8 1494 1827 234 1466 8 19 d<>Artagnan<>of<>19 -6813747.2098 2 1494 1827 6321 1466 2 2 murmured<>d<>Artagnan<>20 -6813747.3751 13 65 1494 1827 13 13 1466 d<>Artagnan<>related<>21 -6813753.6881 2 1494 1827 34 1466 2 9 a<>d<>Artagnan<>22 -6813764.7120 1 4561 1494 1827 1 1 1466 demanded<>d<>Artagnan<>23 -6813770.4532 8 31 1494 1827 8 8 1466 d<>Artagnan<>remained<>24 -6813773.4664 2 1494 1827 97 1466 2 9 d<>Artagnan<>looked<>25 -6813773.8246 2 1494 1827 99 1466 2 9 d<>Artagnan<>went<>26 -6813773.9592 9 1494 1827 187 1466 9 14 d<>Artagnan<>felt<>27 -6813774.6053 7 1494 1827 97 1466 7 11 d<>Artagnan<>blushed<>28 -6813776.7168 1 1494 1827 12 1466 1 5 continued<>d<>Artagnan<>29 -6813778.1083 12 170 1494 1827 12 12 1466 d<>Artagnan<>a<>30 -6813778.9446 4 1494 1827 4561 1466 4 4 of<>d<>Artagnan<>31 -6813780.3571 63 6321 1494 1827 63 63 1466 d<>Artagnan<>bowed<>32 -6813781.2949 1 1494 1827 41 1466 1 6 d<>Artagnan<>did<>33 -6813783.5557 8 1494 1827 365 1466 8 15 to<>d<>Artagnan<>34 -6813785.0053 60 6501 1494 1827 60 60 1466 that<>d<>Artagnan<>35 -6813785.9951 41 3225 1494 1827 41 41 1466 d<>Artagnan<>drew<>36 -6813786.6299 2 1494 1827 80 1466 2 7 ?<>d<>Artagnan<>37 -6813786.6915 5 1960 1494 1827 5 20 1466 in<>d<>Artagnan<>38 -6813788.4593 2 3146 1494 1827 2 2 1466 d<>Artagnan<>threw<>39 -6813791.3619 1 1494 1827 45 1466 1 5 which<>d<>Artagnan<>40 -6813791.5723 25 1495 1494 1827 25 25 1466 d<>Artagnan<>thought<>41 -6813792.0459 3 1494 1827 146 1466 3 8 interrupted<>d<>Artagnan<>42 -6813792.7101 5 28 1494 1827 5 5 1466 and<>d<>Artagnan<>43 -6813795.5949 45 5350 1494 1827 45 45 1466 d<>Artagnan<>followed<>44 -6813798.6235 3 1494 1827 86 1466 3 6 d<>Artagnan<>uttered<>45 -6813799.1809 1 1494 1827 42 1466 1 4 d<>Artagnan<>believed<>46 -6813799.3480 2 1494 1827 64 1466 2 5 d<>Artagnan<>ran<>47 -6813799.3650 1 1494 1827 43 1466 1 4 d<>Artagnan<>could<>48 -6813799.6802 7 1494 1827 337 1466 7 11 d<>Artagnan<>to<>49 -6813799.7664 31 1494 1827 6501 1466 31 31 d<>Artagnan<>turned<>50 -6813799.7933 4 1494 1827 79 1466 4 6 thought<>d<>Artagnan<>51 -6813800.1218 7 146 1494 1827 7 7 1466 Top 50 by user3.pm to<>,<>my<>12471 -0.8713 1 6501 21495 1412 6 36 435 .<>She<>,<>12471 -0.8713 1 11718 207 21495 176 1751 1 ,<>to<>her<>12472 -0.8714 1 21495 6501 1491 265 109 122 ,<>so<>,<>12473 -0.8732 1 21495 633 21495 70 1687 62 ,<>you<>,<>12474 -0.8756 5 21495 3370 21495 376 1687 285 Artagnan<>,<>of<>12475 -0.8793 1 1827 21495 6321 574 7 83 ,<>?<>said<>12476 -0.8810 1 21495 1960 1960 1 43 180 ,<>and<>me<>12477 -0.8853 1 21495 5350 1356 3002 89 5 .<>This<>,<>12478 -0.8858 1 11718 233 21495 200 1751 1 ,<>on<>,<>12479 -0.8893 1 21495 952 21495 111 1687 20 d<>Artagnan<>of<>12480 -0.8918 2 1494 1827 6321 1466 2 2 was<>,<>the<>12481 -0.8934 1 2511 21495 12566 62 105 561 have<>,<>and<>12482 -0.8941 1 1460 21495 5350 21 5 3002 at<>,<>and<>12483 -0.8966 1 1493 21495 5350 3 5 3002 I<>have<>,<>12484 -0.8985 2 3849 1460 21495 424 168 21 ,<>I<>and<>12485 -0.8996 1 21495 3849 5350 824 72 3 he<>of<>the<>12486 -0.8999 1 2689 6321 12566 2 148 1615 ?<>I<>,<>12487 -0.9001 1 1960 3849 21495 167 298 68 ,<>had<>the<>12488 -0.9004 1 21495 1856 12566 110 1401 49 to<>be<>.<>12489 -0.9008 1 6501 1359 11719 391 463 8 and<>that<>,<>12490 -0.9020 1 5350 3225 21495 170 295 56 you<>,<>to<>12491 -0.9033 1 3370 21495 6501 285 108 265 of<>to<>the<>12492 -0.9054 1 6321 6501 12566 2 55 748 ,<>you<>.<>12493 -0.9066 2 21495 3370 11719 376 303 226 ;<>I<>,<>12494 -0.9075 1 2829 3849 21495 211 116 68 I<>,<>to<>12495 -0.9086 1 3849 21495 6501 68 175 265 ,<>Yes<>,<>12496 -0.9091 1 21495 314 21495 1 1687 240 to<>his<>,<>12497 -0.9112 1 6501 2910 21495 175 572 5 ,<>and<>is<>12498 -0.9120 1 21495 5350 1850 3002 310 3 had<>,<>and<>12499 -0.9161 1 1856 21495 5350 21 7 3002 and<>to<>the<>12500 -0.9175 1 5350 6501 12566 45 268 748 said<>,<>and<>12501 -0.9189 1 1960 21495 5350 118 4 3002 I<>,<>and<>12502 -0.9193 2 3849 21495 5350 68 13 3002 .<>It<>,<>12503 -0.9212 1 11718 362 21495 293 1751 1 a<>d<>Artagnan<>12504 -0.9233 1 4561 1494 1827 1 1 1466 ,<>to<>you<>12505 -0.9286 1 21495 6501 3370 265 373 171 ,<>and<>,<>12506 -0.9301 10 21495 5350 21495 3002 1687 14 ,<>.<>D<>12507 -0.9305 1 21495 11718 367 1 2 310 ,<>that<>,<>12508 -0.9339 2 21495 3225 21495 389 1687 56 to<>,<>said<>12509 -0.9417 1 6501 21495 1960 6 3 1246 ,<>said<>to<>12510 -0.9427 1 21495 1960 6501 1246 295 20 the<>d<>Artagnan<>12511 -0.9453 2 12566 1494 1827 3 2 1466 of<>his<>,<>12512 -0.9503 1 6321 2910 21495 383 612 5 ,<>of<>his<>12513 -0.9524 1 21495 6321 2910 83 293 383 his<>,<>and<>12514 -0.9534 1 2910 21495 5350 5 104 3002 of<>,<>and<>12515 -0.9537 2 6321 21495 5350 7 100 3002 ,<>I<>.<>12516 -0.9556 1 21495 3849 11719 824 303 8 ,<>I<>,<>12517 -0.9601 2 21495 3849 21495 824 1687 68 and<>,<>and<>12518 -0.9706 1 5350 21495 5350 14 19 3002 ,<>he<>,<>12519 -0.9730 1 21495 2688 21495 499 1687 190 The top 50 of the output of ll3.pm is constructed by many trigrams that contain one name in the novel, but all these trigrams are meaningful and they all rank highly in the output of count.pl. The top 50 of the output of user3.pm is constructed by many trigrams that don't rank highly in the output of count.pl. So, since ll3.pm seems better at identifying significant and interesting collocations, I think the measure of ll3 is better than user3. ***************************************************************************************************************************** CUTOFF POINT: ***************************************************************************************************************************** Also, I don't think I can find obvious cutoff points in the lists of bigrams as ranked by NSP for both ll3.pm and user3.pm, because the scores of both lists are distributed continuously instead of discretely. So the curve of score-rank should be smooth. And that means the the collocations of large corpus are distributed dispersedly instead of concentrately. ***************************************************************************************************************************** OVERALL RECOMMENDATION: ***************************************************************************************************************************** I think ll3.pm is better for identifying significant collocations in large corpora of text. The trigrams in the output by ll3.pm are all meaningful and rank highly in the output of count.pl. Though many top-ranked trigrams contain one name, I think that's the problem of the special novel.