++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
******************************************************
*CS8761  Assignment 2  The Great Collocation Hunt *
*Author:Amine AbouRjeili *
*Date: 10/09/2002 *
*Filename: experiments.txt *
******************************************************
Objective

To investigate the effectiveness of different measures of association in identifying collocations
and compare their performance. All measures are to be implemented as NSP (v0.51) modules.
The True Mutual Information (tmi.pm) for bigrams and the Log Likelihood Ratio for trigrams were to be
implemented and compared with an additional measure that can be used with both bigram and trigram
collocations. After extensive research using online resources and library books, I decided to use
the tscore association measure. This is a statistical test that tells us how probable or improbable
it is that a certain event will occur (Manning and Schutze). It is defined as:
t = (sample_mean  distribution_mean) / sqrt(sample_variance/sample_size)
This test assumes that a sample is drawn from a normal distribution and it determines how
probable it is to draw a sample of this mean and variance. It compares the sample mean and distribution
mean and scales it by the sample variance. In these experiments the NULL hypothesis is:
H(0): P(W1 W2 ... Wn) = P(W1)*P(W2)* ...* P(Wn)
In other words we are assuming that the words are independent and that the occurance of N word
collocations is merely by chance. Based on the tscore for a particular collocation we can reject
the NULL hypothesis with a certain confidence interval or accept it as being true if we do not fall
inside the required confidence interval. The distribution mean is taken to be the product of the
probability of the occurance of the individual words of the collocation [p(W1)*p(W2)*...*P(Wn)] since
we are assuming that they are independent (NULL hypothesis).
I chose the ttest because of its simplicity in implementation. I wanted to compare its performance
with other association measures that measure actual counts of events and not on estimated sample mean
and variance. I wanted to know how well it will perform compared to the more complex measures of association.
I believe that it will not perform as good as the true mutual information because it considers mainly the
joint probability of the individual words of the collocation. On the other hand the true mutual information
test also considers the cases where the individual words occur separate together with the other cases.
Another reason that I chose this test is because it extends easily to trigrams. The NULL hypothesis
becomes
P(W1 W2 W3) = P(W1) * P(W2) * P(W3)
and so the distribution mean will equal P(W1) * P(W2) * P(W3). As we can see this extension is straightforward
to understand and to implement (see Experiment 2).
I created two corpora of about 1,000,000 words by joining a number of books from Project Gutenberg as
follows:
CORPUS 1: # TOKENS 1083129
Don Quixote by Miquel de Cervantes
Crime and Punishment, by Dostoevsky
Robinson Crusoe, by Daniel Defoe
The Adventures of Sherlock Holmes by Arthur Conan Doyle
Hamlet by Shakespeare PG has multiple editions of William Shakespeare's Complete Works
CORPUS 2: # TOKENS 1037882
==> bndlt10.txt <==
The Project Gutenberg Etext of A Bundle of Letters by Henry James
==> frnrd10.txt <==
The Project Gutenberg etext of The Friendly Road; New Adventures in
Contentment by David Grayson (pseudonym of Ray Stannard Baker)
==> ktysc10.txt <==
The Project Gutenberg EBook of What Katy Did At School, by Susan Coolidge
==> lchch10.txt <==
The Project Gutenberg EBook of Chess and Checkers: The Way to Mastership
by Edward Lasker
==> lstfc10.txt <==
The Project Gutenberg Etext of Lost Face, by Jack London
==> mrswf10.txt <==
The Project Gutenberg Etext of The Mayor's Wife, by Anna Katherine Green
==> outpo10.txt <==
The Project Gutenberg Etext of Outpost by J.G. Austin
==> rbcru10.txt <==
The Project Gutenberg Etext of Robinson Crusoe, by Daniel Defoe
==> rdst10.txt <==
The Project Gutenberg EBook of Rodney Stone, by Arthur Conan Doyle
==> stjlp10.txt <==
The Project Gutenberg EBook of The Story Of Julia Page, by Kathleen Norris
==> strlc10.txt <==
The Project Gutenberg Etext of The Story Of Electricity, by John Munro
==> trbny10.txt <==
The Project Gutenberg EBook of The Rover Boys in New York, by Arthur M. Winfield
==> txsrn10.txt <==
The Project Gutenberg EBook of A Texas Ranger, by William MacLeod Raine
==> wstys11.txt <==
The Project Gutenberg Etext of Under Western Eyes, Joseph Conrad
**********************************************
*****************Experiment 1*****************
**********************************************
TOP 50 COMPARISON:

Using CORPUS2:
Bigrams

count.pl CORPUS_2N.cnt CORPUS2
statistics.pl tmi CORPUS_2N.tmi CORPUS_2N.cnt
Top 26 entries shown for each test
tmi user2 (tscore)
 
1206150 1206150
,<>and<>1 0.0232 14223 79083 31336 Todos<>los<>1 1098.2477 1 1 1
.<>The<>2 0.0109 3436 53260 4038 COINCIDENCES<>SUGGESTING<>1 1098.2477 1 1 1
don<>t<>3 0.0087 1256 1257 4248 Fops<>Alley<>1 1098.2477 1 1 1
of<>the<>4 0.0084 6644 26941 52108 laisser<>aller<>1 1098.2477 1 1 1
.<>He<>5 0.0082 2503 53260 2834 Scots<>Magazine<>1 1098.2477 1 1 1
.<>It<>6 0.0075 2331 53260 2686 superseded<>manual<>1 1098.2477 1 1 1
,<>,<>7 0.0066 2 79083 79083 los<>Santos<>1 1098.2477 1 1 1
in<>the<>8 0.0065 4523 15673 52108 _ce<>miserable_<>1 1098.2477 1 1 1
.<>But<>9 0.0059 1801 53260 2055 ga<>irls<>1 1098.2477 1 1 1
.<>I<>10 0.0058 5149 53260 22711 Ari<>Frode<>1 1098.2477 1 1 1
,<>but<>11 0.0052 2690 79083 4665 CAVE<>RETREAT<>1 1098.2477 1 1 1
.<>She<>12 0.0050 1560 53260 1806 strutting<>cockerel<>1 1098.2477 1 1 1
,<>.<>13 0.0044 1 79083 53260 Sallie<>Laundon<>1 1098.2477 1 1 1
the<>,<>14 0.0043 2 52108 79083 DARTAWAY<>Quit<>1 1098.2477 1 1 1
had<>been<>15 0.0042 1013 7956 2446 quel<>nom<>1 1098.2477 1 1 1
Mrs<>.<>16 0.0041 1103 1105 53260 personne<>indiquee_<>1 1098.2477 1 1 1
;<>and<>17 0.0039 2020 7387 31336 GIMLET<>BUTTE<>1 1098.2477 1 1 1
.<>And<>17 0.0039 1341 53260 1735 INTERFERENCE<>Struck<>1 1098.2477 1 1 1
.<>You<>18 0.0038 1298 53260 1704 Highland<>Fling<>1 1098.2477 1 1 1
Mr<>.<>18 0.0038 1020 1029 53260 Humphry<>Gunther<>1 1098.2477 1 1 1
did<>not<>18 0.0038 790 1643 5777 WHICH<>CHRISTIAN<>1 1098.2477 1 1 1
.<>,<>19 0.0037 133 53260 79083 Andrew<>Gamble<>1 1098.2477 1 1 1
;<>but<>20 0.0034 1026 7387 4665 TAMES<>GOATS<>1 1098.2477 1 1 1
to<>be<>20 0.0034 1532 27462 5008 Ferned<>grot<>1 1098.2477 1 1 1
I<>am<>20 0.0034 823 22711 968 MOST<>SUDDEN<>1 1098.2477 1 1 1
Project<>Gutenberg<>21 0.0033 322 408 322 azure<>firmament<>1 1098.2477 1 1 1
Considering the top 50 entries in each test, it is clear that the tscore gave better results as
compared to the tmi. Most of the top entries of the tmi test include puntuation marks and stop
words which are not of much interest. One interesting collocation identified by the tmi measure is
"Project<>Gutenberg<>".
On the other hand the tscore test gives more meaningful words
in the collocation and identifies some good collocations such as "Intellectual<>Peach<>", which
most probably has some meaning as a unit of words and does not actually mean an 'intellectual peach'.
another good collocation identified by the tscore test is "Padre<>Johannes<>" which is of the proper
nouns form. Another interesting collocation is "Rural<>Taste<>". For most of the entries in the tscore
table we can safely ignore the NULL hypothesis with a confidence interval of 99.5%. On the other hand
there is no comparable measure of confidence interval for the true mutual information measure. From the
current experiments, one might be tempted to conclude that the tscore measures are significantly better
at identifying interesting collocations as compared to the true mutual information measures. However, here
I would like to point out the fact that it might be the case that the tscore measure might be
overclassifying entries. In other words it thinks that most bigrams are collocations while they are not.
There might be a lot of false positives in the data and from the above sample output it can be seen that
some of the top bigrams belong to this classification. This also can be true for the true mutual information
measures as well. Here I would like to suggest the use of some sort of ratio that will calculate the
accuracy of the measure. In other words it will tell us how many of the bigrams are false positives
and which ones are actually true collocations.
However, if we disregard the idea of false positives, I can conclude from the above data that the
tscore performed better than the true mutual information test in this experiment. One reason for this
difference in results can be attributed to the expected frequency of the individual tokens for a bigram.
For example, punctuation marks might have appeared a lot of times in the text and this will affect the
calculation of the tmi test since it is based on the expected and observed frequencies for each cell.
However, the tscore test does not consider all the cells in the contingency table, as discussed above.
CUTOFF POINT:

User2.pm:
For this measure, it seems that a cutoff exists approximately around the 150 (+20) rank item.
It can be observed that more and more bigrams do not seem to be good collocations.
After about the 120th rank it seems that at least a third of the bigrams are not collocations. Since
this test has as the numerator the difference between the sample mean and the distribution mean and as
we go deeper down in the list the rank decreases, this means that the difference is decreasing at a higher
rate compared to the numerator. This indicates that the sample mean and the distribution mean are merging to
a common value. Since our NULL hypothesis suggests that the tokens of the sequence are independent of
each other, and the mean of the distribution is based on this and the sample mean is converging towards the
distribution mean then the sequence in our sample will become more and more independent as we move down the
list.
tmi.pm:
For this measure, there is a cutoff around the 50th rank item. It seems that most of the bigrams
in the top positions are not bigrams and consist of punctuation marks and stop words. However, after about
the 50th rank we start to see less quotation marks and more word bigrams. We also start to observe some
interesting collocations such as "Web<>sites<>" which occurs at rank 50. On the other hand the user2.pm
measure ranks this collocation at rank 120 which presents a big difference between the two rankings. This
difference is due to the different counts that each measure considers. One frequency count that the true
mutual information considers but the tscore does not is the count of when both words did not occured
together, and I suggest that this contributes to this difference.
RANK COMPARISON

Comparing ll with tmi and user2:

*****************************************************
csdev22(52):>./rankscript.sh ll tmi CORPUS_2N.cnt
Rank correlation coefficient = 0.8878
csdev22(53):>./rankscript.sh ll user2 CORPUS_2N.cnt
Rank correlation coefficient = 0.8745
*****************************************************
Comparing mi with tmi and user2:

*****************************************************
csdev18(15):>./rankscript.sh mi tmi CORPUS_2N.cnt
Rank correlation coefficient = 0.8889
csdev18(16):>./rankscript.sh mi user2 CORPUS_2N.cnt
Rank correlation coefficient = 0.9516
*****************************************************
From the above results it can be observed that user2.pm (ttest) is very closely
related to both the mutual information and the log likelihood ratio. To add to this,
it is near perfect match to mutual information. On the other hand the true mutual information
test is a reversed ranking as compared to the mutual informatio and the log likelihood
test to approximately the same degree. One reason to account for this is that the
true mutual information test accounts for all the cells in the contingency table, but the
mutual information is only concerned with one of the cells, so the influence of the
frequency counts differ. This is also the case with user2.pm which only considers certain
cells and not all of them.
OVERALL RECOMMENDATION

Based on the experiments carried out, I would recommend the user2, mi and ll measures of
association since they are very similar. I came up with this conclusion from the results
of the user2.pm. From the rank.pl program it was seen that it is very similar to ll and mi
and so they should produce approximately similar results in term of rankings. The output of
tmi was not very good in term of the top entries which were not very meaningful and seemed to
ordinary bigrams with no association. Also, I noticed that the user2.pm (ttest) converges to the
ztest when the degrees of freedon are large as is in these experiments. This can also be seen
from the tdistribution confidence interval values.
**********************************************
*****************Experiment 2*****************
**********************************************
This experiment involves collocations composed of 3 word sequences (trigrams). The first test to
implement is the loglikelihood test modified to consider trigrams.
The second test is the ttest measurement (user3.pm) which will be extended to acomodate for
three variables instead of two. As a result it will be able to handle 2x2x2 contingency tables.
The following experiments were carried out:
TOP 50 RANK

Example run of the program:

count.pl ngram 3 CORPUS_3N.cnt CORPUS_3N
statistics.pl ngram 3 user3 CORPUS_3N.user3 CORPUS_3N.cnt
top 26 entries shown for each test
user3

1206149
_ma<>chere<>amie_<>1 1206149.0000 1 1 1 1 1 1 1
YZ<>MN<>OP<>1 1206149.0000 1 1 1 1 1 1 1
DEN<>WILD<>ZEE<>1 1206149.0000 1 1 1 1 1 1 1
saute<>aux<>champignons<>1 1206149.0000 1 1 1 1 1 1 1
ai<>trop<>dit<>1 1206149.0000 1 1 1 1 1 1 1
Cuesta<>del<>Burro<>1 1206149.0000 1 1 1 1 1 1 1
se<>laisser<>aller<>1 1206149.0000 1 1 1 1 1 1 1
vais<>seux<>cella<>1 1206149.0000 1 1 1 1 1 1 1
EXTRA<>PRIVATE<>LESSONS<>1 1206149.0000 1 1 1 1 1 1 1
WX<>GH<>IJ<>1 1206149.0000 1 1 1 1 1 1 1
WHICH<>CHRISTIAN<>MEETS<>1 1206149.0000 1 1 1 1 1 1 1
SPIRITUAL<>INTERFERENCE<>Struck<>1 1206149.0000 1 1 1 1 1 1 1
cauld<>kail<>hae<>1 1206149.0000 1 1 1 1 1 1 1
MN<>OP<>QR<>1 1206149.0000 1 1 1 1 1 1 1
COINCIDENCES<>SUGGESTING<>SPIRITUAL<>1 1206149.0000 1 1 1 1 1 1 1
Todos<>los<>Santos<>1 1206149.0000 1 1 1 1 1 1 1
AB<>CD<>EF<>1 1206149.0000 1 1 1 1 1 1 1
SUGGESTING<>SPIRITUAL<>INTERFERENCE<>1 1206149.0000 1 1 1 1 1 1 1
DAN<>BAXTER<>GIVES<>1 1206149.0000 1 1 1 1 1 1 1
filet<>saute<>aux<>1 1206149.0000 1 1 1 1 1 1 1
YE<>GENTLE<>READER<>1 1206149.0000 1 1 1 1 1 1 1
STEVE<>OFFERS<>CONGRATULATIONS<>1 1206149.0000 1 1 1 1 1 1 1
CD<>EF<>ST<>1 1206149.0000 1 1 1 1 1 1 1
_la<>personne<>indiquee_<>1 1206149.0000 1 1 1 1 1 1 1
BAXTER<>GIVES<>AID<>1 1206149.0000 1 1 1 1 1 1 1
Intellectual<>Peach<>Parer<>1 1206149.0000 1 1 1 1 1 1 1
ll3

1206149
.<>,<>the<>1 7786623.9583 2 53260 79083 52108 133 1502 2230
.<>.<>,<>2 7751902.6629 89 53260 53260 79083 1603 4088 133
,<>of<>,<>3 7715136.9923 2 79083 26941 79083 505 5551 49
,<>I<>,<>4 7394871.8862 3 79083 22711 79083 3598 5551 196
,<>and<>,<>5 7341499.8465 333 79083 31336 79083 14223 5551 651
,<>in<>,<>6 7192382.4111 1 79083 15673 79083 1132 5551 235
,<>was<>,<>7 7079014.9195 6 79083 12419 79083 500 5551 249
,<>that<>,<>8 7026236.7496 45 79083 12135 79083 1212 5551 472
,<>,<>one<>9 6962526.4875 1 79083 79083 3502 2 248 198
,<>you<>,<>10 6904810.7321 5 79083 9087 79083 858 5551 605
,<>had<>,<>11 6904386.0245 4 79083 7956 79083 277 5551 83
two<>,<>,<>12 6891919.9773 1 1611 79083 79083 56 147 2
,<>he<>,<>13 6880522.0448 1 79083 8746 79083 1592 5551 153
,<>for<>,<>14 6873635.0312 16 79083 7959 79083 990 5551 96
,<>s<>,<>15 6859784.9031 1 79083 6644 79083 1 5551 105
,<>as<>,<>16 6843698.9151 9 79083 8024 79083 1899 5551 30
square<>,<>,<>17 6837502.3425 1 174 79083 79083 23 11 2
,<>,<>V<>18 6834679.2425 1 79083 79083 79 2 4 1
,<>is<>,<>19 6830741.7429 1 79083 6389 79083 270 5551 222
,<>my<>,<>20 6823826.3993 3 79083 6056 79083 330 5551 9
and<>,<>the<>21 6821418.3294 5 31336 79083 52108 651 1409 2230
,<>be<>,<>22 6788346.9271 1 79083 5008 79083 43 5551 74
,<>of<>.<>23 6773018.1244 1 79083 26941 53260 505 1716 47
,<>she<>,<>24 6769209.9658 1 79083 5660 79083 1158 5551 86
,<>me<>,<>25 6762351.7502 3 79083 5225 79083 25 5551 935
,<>have<>,<>26 6762303.2148 2 79083 4391 79083 59 5551 45
Here we have also a similar case to the bigram situation. It seems that in this case the ll3
measure produces top results with meaningless collocations that are merely trigrams.On the other
hand the user3 test produces top results that consisted or alphabetic characters but the majority
of them , I believe, are not meaningful collocations but just happened to be next to each other.
However, there were a some good exampls of collocations such as "Electromotive<>Force<>Resistance<>"
to describe a certain force. Here, I would say that the user3 test performed better with regard to the
top 50 rankings.
CUTOFF POINT

The cutoff point for these two measures differ tremendously. Even the rankings are vastly different.
Consider the following trigram and its associated rank. In the case of the ll3 it is ranked at 713651
but in the case of user3 it is ranked as 1. This is a tremendous difference.
Intellectual<>Peach<>Parer<>713651  ll3
Intellectual<>Peach<>Parer<>1  user3
For the ll3 we start to see trigrams that consist of words made up of alphanumeric characters halfway through
the rankings. By contrast in the user3 ranking table we see this phenomena from the top of the table. I believe
that this is because the likelihood measurement considers both the expected and observed frequencies but
the ttest considers the sample mean and variance together with the distribution mean which equals the
probability of obtaining an element at random with the same mean and variance as that of the NULL hypothesis
(H0).In trigrams the marginal tables are very big because of the large number of combinations and so this
will influence the result of the ll3. I could not find a clear cutoff point for the ll3 rankings. It seemed that
there were a mixture of good collocations mixed everywhere in the table. There was not a clear cut breaking point.
OVERALL RECOMMENDATION

In this experiment I will also recommend the user3 measure (ttest) because it identifies collocations
better as compared to the ll3 measure.
REFERENCES:

 Foundations of statistical Natural Language Processing  Manning and Schutze
 The analysis of CrossClassified Categorical Data  Fienberg
 Measures of Association  Albert M. Liebetrau
 Statistical Tachniques for Data Analysis  John Keenan Taylor
 Multiway Contingency table analysis for the social sciences  Thomas D. Wickens ++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Name : Nitin Agarwal Date 10Oct02
Class : Natural Language Processing CS8761
=========================================================
Objective
To find a test of measure such that it holds good for
bigrams as well as trigrams and implement it on a
corpora, and then finally compare this measure with
some standard measures.
Procedure
First, Ngram Statistics package (NSP) version 0.51
needs to be downloaded and installed. Once, this has
been done, a test of measure needs to be found that
can be used with both 2 and 3 word sequences. Now, we
need to implement this test of measure on a huge
corpus to identify the 2 and 3 word collocations in
the corpus. The implementation will consist of four
.pm files namely tmi.pm, user2.pm, ll3.pm and
user3.pm.
Corpus
The corpus used for this assignment is corpus.txt.
Measure of association
Ttest has been used. The formula for Ttest is as
given below
tscore = O11E11/sqrt (O11)
where O11 is observed value
and E11 is expected value
The measure of association was found on
http://www.collocations.de/EK/amhtml/
Conventions used
In this experiment, all the output files have an
extension .op2 and .op3 depending on whether it is for
a bigram or a trigram respectively. Output file name
is same as the name of the file being processed.
Experiment 1

TOP 50 COMPARISON
Fiftieth bigram in tmi.op2 is ranked at 29, whereas
in user2.op2 this row is occupied by a rank 49
bigram. Hence, we observe that in the latter file,
except one all other bigrams up to the rank of 50
have unique scores which is not the case with tmi.op2.
From the output files it can be seen that when
compared to top 50 collocations in tmi.op2, user2.op2
has more collocations with higher frequency of
occurring together.
CUTOFF POINT
In user2.op2, from the line 45312 the value of the
score goes negative. All the trigrams above this line
have positive value. I could not find any cutoff point
for ll.op2. The possible reason behind this is that
the scores calculated using this measure are pretty
high and hence do not reach a negative value.
RANK COMPARISON
tmi.op2 Vs mi.op2

Rank correlation coefficient = 0.0874
As the value of the rank correlation coefficient is
just above 0, the two files almost unrelated to each
other and have very small amount of relatedness. One
important point worth noticing is that in mi.op2, all
of first 121 bigrams have the same rank 1. However,
conversely in tmi.op2 most of the bigrams on the top
of the list have unique scores. As we move down the
list, some of the bigrams seem to be sharing the same
rank and the numbers of these bigrams seem to be
increasing at an almost steady rate. Looking at the 3
rightmost columns, it can be noticed that they all
have value 1 for first ranked bigrams in mi.op2,
whereas in tmi.op2 the value for the number of
occurrences (third column of numbers) does not seem to
be following any progression. This shows that mi.op2,
lists all the collocations that occur once in the
beginning. One interesting point, that I noticed which
may not be very relevant is that, in tmi.op2 the score
of the highest ranked bigram is 0.0387 which is close
to the inverse of the highest ranked bigram of mi.op2
whose score is 17.0977.
output of tmi.pm
,<>and<>1 0.0387 2634 11467 4775
I<>had<>2 0.0174 810 5158 1546
;<>and<>3 0.0149 844 2330 4775
of<>the<>4 0.0086 817 3558 5854
output of tmi.pm
SEND<>MONEY<>1 17.0977 1 1 1
isle<>Fernando<>1 17.0977 1 1 1
COUP<>DE<>1 17.0977 1 1 1
repenting<>prodigal<>1 17.0977 1 1 1
tmi.op2 Vs ll.op2

Rank correlation coefficient = 0.1916
Even in this comparison the rank correlation
coefficient is positive but a very low value. Hence,
we infer that again the two output files have only a
small degree of relatedness, although this time it is
more than the previous comparison. This time again for
ll.op2 the all the collocations on the top of the list
occur only once. In tmi.op2 the highest score is
0.0387 and the lowest value is 0. Whereas, in the case
of ll.op2, highest value is 7516, which again goes all
the way down to 0. Hence, we observe that the gradient
for ll.op2 is much steeper than the gradient for
tmi.op2.
output of ll.pm
,<>and<>1 7516.6877 2634 11467 4775
I<>had<>2 3387.5897 810 5158 1546
;<>and<>3 2893.9474 844 2330 4775
;<>but<>4 1675.1604 349 2330 958
user2.op2 Vs mi.op2

Rank correlation coefficient=0.5824
The rank of 0.5824 shows that the two files are
neither completely unrelated nor they match perfectly.
They are somewhere between the two. As obvious from
the rank value, user2.op2 is closer to mi.op2 when
compared with tmi.op2. One similarity between the two
files is that towards the end, the scores attain
negative values in both the files. Again in this
comparison, in user2.op2 in the beginning all the
scores are unique which is not the case with mi.op2.
output of user2.pm
,<>and<>1 43.7160 2634 11467 4775
I<>had<>2 26.4628 810 5158 1546
;<>and<>3 26.3213 844 2330 4775
of<>the<>4 23.3878 817 3558 5854
user2.op2 Vs ll.op2

Rank correlation coefficient=0.8393
As seen from the coefficient value for these 2 files,
they are nearly a perfect match. Most of the rows in
the two files have nearly the same collocations. The
only major difference between the two files is the
value of the scores for the individual bigrams which
is far apart in the beginning gets closer towards the
middle and again starts to diverge at the end. Once
more, user2.op2 is closer to ll.op2 which is clear
from its rank correlation coefficient.
OVERALL RECOMMENDATION
I would suggest ll.pm as a good measure for
identifying significant collocations. As I have
discussed above user2.op2 and ll.op2 have unique ranks
for most of the bigrams. The total number of unique
ranks for user2.op2 is 12767 whereas for ll.op2 it is
19482. Since ll.op2 has more unique ranks than
user2.op2 it is a better measure of association.
Experiment 2

TOP 50 RANK
In ll3.op3, most of the top 50 elements have rank 1,
whereas in user3.op3, most of the top 50 entries have
different ranks. ll3.op3 lists all the trigrams which
occur once at the bottom, whereas user3.op3 lists a trigram
that occurs 6 times first and then this number starts
decreasing before it starts increasing again towards the end of the file.
output of user3.pm
by<>Daniel<>Defoe<>1 2.3877 6 590 6 6 6 6 6
PROJECT<>GUTENBERG<>tm<>2 2.2353 5 7 7 5 7 5 5
Small<>Print<>!<>3 2.2287 5 5 5 92 5 5 5
this<>Small<>Print<>4 2.1764 5 749 5 5 5 5 5
Illinois<>Benedictine<>College<>5 1.9998 4 4 4 4 4 4 4
output of ll3.pm
data<>,<>transcription<>1 392089.1000 1 1 11467 1 1 1 1
,<>shame<>opposed<>1 392089.1000 1 11467 1 1 1 1 1
los<>Santos<>,<>1 392089.1000 1 1 1 11467 1 1 1
computerized<>population<>,<>1 392089.1000 1 1 1 11467 1 1 1
temperance<>,<>moderation<>1 392089.1000 1 1 11467 1 1 1 1
CUTOFF POINT
In user3.op3, from the line 8634 the value of the
score goes negative. All the trigrams above this line
have positive value. In the case of ll3.op3, there
doesn't seem to be any cutoff value.
OVERALL RECOMMENDATION
For trigrams, I would suggest user3.op3 for the same
reasons that were given for bigrams.
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Kailash Aurangabadkar
CS8761
Assignment #2

The objective of the assignment is to investigate various measures of
association that can be used to identify collocations in large corpora (CORPUS)
of text.
The assignment is essentially to find out interesting cooccurrences of
two or more words i.e. their cooccurrence is not coincidental. Such words
group to form a collocation.

Collocations:
A collocation is a sequence of two or more consecutive words, that has
characteristics of a syntactic and semantic unit, and whose exact and
unambiguous meaning or connotation cannot be derived directly from the meaning
or connotation of its components.
Examples of Collocations:
Collocations include noun phrases like "strong tea" and "weapons of mass
destruction", phrasal verbs like "to make up", and other stock phrases like the
"rich and powerful", "a stiff breeze" but not a "stiff wind", "broad daylight" etc.
Criteria for Collocations
Typical criteria for collocations: noncompositionality, non
substitutability, nonmodifiability. Collocations cannot be translated into
other languages word by word.A phrase can be a collocation even if it is not
consecutive (as in the example knock . . . door).
Compositionality
A phrase is compositional if the meaning can predicted from the meaning of the
parts.
Collocations are not fully compositional in that there is usually an element of
meaning added to the combination. E.g. strong tea.
Idioms are the most extreme examples of noncompositionality. E.g. to hear it
through the grapevine.
NonSubstitutability
We cannot substitute nearsynonyms for the components of a collocation.
Nonmodifiability
Many collocations cannot be freely modified with additional lexical material or
through grammatical transformations.

Process:
The assignment is an addition to the NSP package currently available and
developed by Satanjeev Banerjee. The implementation consists of four .pm files
that will be used by statistic.pl program that comes with NSP. These .pm files
will only run when used with NSP.

The assignment consists of two experiments:
Experiment 1:
To implement (true) mutual information and a test of association, that
is not implemented in NSP, for bigrams. The association tests implemented in
NSP are: Left Fisher, ChiSquared, Dice, Point wise Mutual Information, and Log
Likelihood.
I have selected to implement utest for collocations. The implementation
of utest is done in user2.pm. I have also implemented the student ttest for
association. The implementation for ttest is done in tscore.pm. After
implementing the three tests I executed these on bigrams collected from CORPUS,
which was obtained by concatenating all the text files in BROWN1. The null
hypothesis for the utest and the ttest is that the two words are independent.
Top 50 comparisons:
I have executed the user2.pm, tmi.pm and tscore.pm on a corpus of text which
consists of 1336151 bigrams which is the concatenation of all text files
in /home/cs/tpederse/CS8761/BROWN1. This text is available
at /home/cs/aura0011/CS8761/nsp/CORPUS .
User 2: We find the following significant collocations in the top 50
ranks when executed on CORPUS:
Breezy clotheslines, Peace Corps, Chemical Name, Southern California, final
solution, policy trends, Agatha Christie, Hong Kong, Rhode Island, export
import, testimony whereof, Electronic Circuits, Polypropylene glycol, barbarian
hordes, output control, polio vaccine, freight car, armed force etc.
It is observed that the utest gives significant collocations, but also
gives a lot of insignificant ones with them. The u test is dependent on the
independent and joint frequency of the two words. If the two words tend to
occur together always and each word in that collocation occurs only rarely in
any other construct then utest recognizes them as good candidates of
collocation. But collocation made up of two more frequent words like bitten and
down are ranked low as the two words making up the collocation also occur
without each other quite often.
These are the some of the entries in the output file created by user2.pm
1336151
.....
Rosy<>Fingered<>1 1155.9191 1 1 1
dabhumaksanigalu<>ahai<>1 1155.9191 1 1 1
Alcohol<>ingestion<>1 1155.9191 1 1 1
URETHANE<>FOAMS<>1 1155.9191 1 1 1
breezy<>clotheslines<>1 1155.9191 1 1 1
.....
MICROMETEORITE<>FLUX<>26 943.8026 2 2 3
Unifil<>loom<>27 943.8005 4 4 6
PEACE<>CORPS<>27 943.8005 4 6 4
JUNE<>18_<>27 943.8005 4 6 4
......
DEAE<>cellulose<>30 909.4587 13 13 21
Ham<>Richert<>31 895.3684 3 3 5
Final<>Solution<>31 895.3684 3 5 3
Sheraton<>Biltmore<>31 895.3684 3 5 3
Pro<>Arte<>31 895.3684 3 3 5
T score: There are no any significant collocations in T score as it
is based on frequency of occurrences. Some enteries are:
1336151
of<>the<>1 78.1686 9181 36043 62690
.<>The<>2 66.7814 4961 49673 6921
in<>the<>3 60.4230 5330 19581 62690
,<>and<>4 57.7736 5530 58974 27952
.<>He<>5 46.9438 2421 49673 2991
on<>the<>6 40.4487 2196 6405 62690
.<>It<>7 39.4899 1718 49673 2184
to<>be<>8 37.3632 1631 25725 6340
,<>but<>9 37.3273 1648 58974 3006
to<>the<>10 36.0291 3266 25725 62690
.<>In<>11 34.5555 1320 49673 1736
.<>I<>12 34.0372 1565 49673 5877
at<>the<>13 31.9296 1448 4966 62690
.<>But<>14 31.8652 1115 49673 1371
for<>the<>15 30.4946 1655 8833 62690
from<>the<>16 29.6802 1243 4190 62690
Mutual Information: We find the following significant
Collocations in the top 50 collocations:
United States, Rhode Islands, Peace corps, Bang Jensen, Air Force, Fiscal year,
nineteenth century, President Kennedy, United Nations, Civil War, New Orleans,
The Editor, Kansas city, Nuclear Weapons, High School, Vice President, York
city, Export Import, Supreme court, St. Louis etc.
We find that the mutual information is good in finding out
collocations which are made of proper nouns. Some enteries in the output are as follows:
1336151
....
on<>the<>10 0.0031 2196 6405 62690
United<>States<>11 0.0029 346 456 447
had<>been<>12 0.0028 721 5096 2470
....
years<>ago<>32 0.0008 126 948 246
Rhode<>Island<>32 0.0008 77 89 129
....
ominant<>stress<>38 0.0002 28 62 108
There<>were<>38 0.0002 74 901 3279
Export<>Import<>38 0.0002 14 15 15
be<>taken<>38 0.0002 52 6340 278
.....
he<>asked<>38 0.0002 57 6799 392
nuclear<>weapons<>38 0.0002 20 110 61
But<>it<>38 0.0002 90 1371 6873
Cutoff Point:
User 2: By scanning through the output of the user2.pm module I
find the cutoff point approximately when u value is around 12. Below 12 uvalue
the collocations given are the bigrams made up of common words.
TScore: Tscore gives a scattered distribution of collocations,
with more frequently occurring bigrams at top and
neglecting the lower frequency, but important, ones. Thus a cut off value
cannot be suggested for T score.
Mutual Information: As mutual information depends on observed and
expected frequencies a cut off is difficult to suggest.
Rank Comparison:
"True" Mutual Information: When rank.pl is executed with output of "True"
mutual information and output of Pointwise Mutual Information the value
returned is 0.96, which suggests that mutual information and pointwise mutual
information are not the same measures, but in fact are opposite as the corpus
size goes on increasing. When I tried them on a smaller corpus of 700000
bigrams then rank.pl returned a value of 0.7.
When rank.pl is executed with output of "True" mutual information and output of
Log likelihood ratio measure the value returned is 0.96, which suggests that
mutual information and Log likelihood are also not similar. In fact these two
also go on differing as the Corpus size increases.
Mutual Information tends to remove the discrepancies of likelihood ratio
measures and pointwise mutual information. So it is a measure not like any of
them.
User 2: When rank.pl is executed with the values returned by utest and
pointwise mutual information then we find that the value turns out to be 0.9641
while those of utest with those of Log likelihood ratio is 0.93.
The measures pointwise mutual information, Log likelihood ratio and utest are
based on the calculation on the observed and expected values of the bigram.
While the "True" mutual information is actually is a calculation on all the
values in the contingencies table and try to analyze the Corpus on a whole.
Thus it is different than the other three.
Overall Recommendation:
User2 :
This test is also based on the observed and expected values. But, it
is more dependent on the difference between joint frequency and
independent frequency. Thus it ranks less frequently occurring bigrams higher
than those occurring more frequently. The observations from investigations also
suggest that the highly ranked bigrams are those whose joint frequency and
independently frequency does not differ. This test gives a relatively good
measure of collocations as can be seen from the output.
TScore:
In cases, where n and c are not very frequent (and most words are
infrequent), and where the corpus is large, then f(n)f(c)/N, will be a very
small number . In such cases, subtracting this number from f(n,c) will make
only a small difference. It follows that
T = approx f(n,c) / sqrt(f(n,c)) = sqrt(f(n,c))
Therefore the main factor in the value of T is simply the absolute frequency
of joint occurrences. The Tvalue picks out cases where there are many joint
occurrences, and therefore provides confidence that the association between n
and c is genuine. But, clearly, whether we rank order things by raw frequency
or by its square root makes no difference.
However, T is sensitive to an increase in the product f(n)f(c). In such cases
T = [f(n,c)  X] / sqrt(f(n,c)),
where X is significantly large relative to f(n,c). Since T decreases if f(n)f
(c) becomes very large, the formula has a builtin correction for cases
involving very common words. In practice, this correction has a large effect
only with a small number of common grammatical words, especially if they are in
combination with a second relatively common word. If the corpus gets larger,
but f(n)f(c) stays the same, then f(n)f(c)/N decreases again, and T
correspondingly increases. Thus T is larger when we have looked at a larger
corpus, and can be correspondingly more confident of our results. Again, this
effect is noticeable only in cases where node and/or collocate are frequent.
But the cases of frequent collocates also form the drawback of the test. This
test does not give importance to infrequent collocations.
Mutual Information:
Mutual Information score expresses the extent
to which observed frequency of cooccurrence differs from expected (where
expected means "expected under the null hypothesis").
But, there is problem with Mutual Information suppose a word appears
just once with another word (which is not an unreasonable event) in the
corpus. And the first word may have a very low overall frequency. Say the first
word occurs only once i.e. in a bigram with the second word. Now we carry out
the sums to calculate the expected value of the bigram and if corpus size is
large, we get a very small expected value (about 1/1000). Now we see that even
if the first word occurs just once with the second word the observed frequency
will be 1000 times more than the expected joint frequency, and the mutual
information value will be high.
Also when I experimented by executing the count module by setting a window size
of 3 for bigrams, I found some interesting bigrams like "rescue prisoners".
These were not present in the output with default window size as there are some
twoword collocations that have some other constant number of words between
them. Like the one mentioned (rescue prisoners). It is not present in the
output in the first case as a phrase like "rescue the prisoners" might have
occurred. This experiment was executed on a corpus of smaller size as it the
number of bigrams increases exponentially with increase in window size.

Experiment 2:
To implement a module named ll3.pm that performs log likelihood
ratio for 3 word sequences. To create the 3 word version of user2.pm and name
it user3.pm.
A 3 word version of the utest is implemented in user3.pm. For both of these
the null hypothesis is extended from that for two words to three words i.e. all
the three words are independent.
Top 50 Rank:
User3: We find the following interesting collocations in the top 50
ranked trigrams:
Getting along with, Rural Road Authority, Plant feeding facilities, Sound Tax
policy, United Nations Day, long term approach, Deal with principles, state
automobile practices, on unpaid taxes, peace corps volunteers, middle Atlantic
states etc.
It gives good candidates to be considered as collocations, but is not
accurate and tend to give the less frequent word combinations in higher
position than more frequent ones. Some of the entries are.:
1336150
_FRANKFURTER<>TWISTS_<>Blend<>1 1336150.0000 1 1 1 1 1 1 1
DEFINE<>INPUT<>OUTPUT<>1 1336150.0000 1 1 1 1 1 1 1
....
BED<>PRESSED<>OR<>38 243946.4984 1 2 1 15 1 1 1
PUSH<>UPS<>BUT<>38 243946.4984 1 2 3 5 2 1 1
PLANT<>FEEDING<>FACILITIES<>38 243946.4984 1 3 2 5 1 1 1
OR<>SODIUM<>HEXAMETAPHOSPHATE<>38 243946.4984 1 15 2 1 1 1 1
....
Dried<>rumen<>bacteria<>41 236200.1814 1 2 2 8 1 1 1
broiled<>steaks<>tantalizing<>41 236200.1814 1 2 4 4 1 1 1
GETTING<>ALONG<>WITH<>41 236200.1814 1 2 1 16 1 1 1
.....
INORS<>MUST<>ALSO<>41 236200.1814 1 2 8 2 1 1 1
0895<>NATURAL<>GAS<>41 236200.1814 1 4 4 2 1 1 1
SOUND<>TAX<>POLICY<>41 236200.1814 1 1 8 4 1 1 1
Accepted<>crystallographic<>symbolism<>41 236200.1814 1 2 2 8 1 1 1
Log likelihood 3: There are not so many interesting trigrams in top 50
rankings of the ll3.pm output. Following are the first 15 entries in the output
1336150
,<>.<>The<>1 27511.2605 7 58974 49673 6921 57 30 4961
of<>.<>The<>2 25527.1297 2 36043 49673 6921 25 2 4961
a<>.<>The<>3 24045.6938 1 21998 49673 6921 15 2 4961
to<>.<>The<>4 24041.1144 3 25725 49673 6921 47 5 4961
.<>The<>.<>5 24006.2920 1 49673 6921 49674 4961 335 1
in<>.<>The<>6 23023.9802 11 19581 49673 6921 81 13 4961
is<>.<>The<>7 22535.4368 2 9969 49673 6921 29 3 4961
he<>.<>The<>8 22493.5350 1 6799 49673 6921 3 3 4961
for<>.<>The<>9 22438.9843 4 8833 49673 6921 28 4 4961
was<>.<>The<>10 22413.8054 1 9770 49673 6921 37 5 4961
with<>.<>The<>11 22337.1213 2 7007 49673 6921 19 2 4961
his<>.<>The<>12 22239.1767 3 6469 49673 6921 21 5 4961
at<>.<>The<>13 22215.6165 1 4966 49673 6921 10 1 4961
from<>.<>The<>14 22191.3867 1 4190 49673 6921 5 1 4961
had<>.<>The<>15 22155.1802 2 5096 49673 6921 17 2 4961
Cutoff point:
It is difficult to spot collocations for trigrams as there are more
possibilities of word combinations. So it is not quite easy to find out the
cutoff point.
User3:
As we skim through the output of user3.pm we find that the
cutoff point occurs somewhere in the range where the value of u  test is
greater than 2000 for the current CORPUS. The entries having uvalue more than
2000 are composed of less frequent words. The entries below 2000 value have
more frequent words like THE, IS, or a punctuation as a word in them and hence
are weak candidates of collocations.
Log likelihood ratio 3:
It gives a rather scattered distribution of the collocations and
hence a cutoff cannot be suggested.
Rank Comparison:
When rank.pl is executed on outputs of Log likelihood ratio
and output of User3 we get a value of 0.4726. This suggest that both the tests
are not related to each other.
Overall Recommendations:
The utest for collocations gives a good result for finding
collocations that do not occur frequently. To find out the infrequent word
combination collocations we have to carefully go through the output created. By
infrequent word combination collocation we mean the collocations formed by
words which occur independently more often than they occur in the collocation.
The Log likelihood ratio measure gives better results for trigrams that
occur more often than those which occur less often. But the trigrams which
occur most are made up of common articles (like The) or verbs (like Is) or
punctuation marks. Thus the high ranked trigrams in the output of ll3.pm are
not significant collocations.
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
CS8761 Natural Language Processing
Assignment 2
Archana Bellamkonda
October 11, 2002.
Problem Description :

The objective is to investigate various measures of association that can be used to identify collocations in large
corpora of text. Following the investigation, a measure is selected and implemented so that it can be used with
2 and 3 word sequences to find the association between the words. This measure is also compared with some other
standard measures and subsequent analysis is done.
[ Measure of Association is given by a value calculated from the method selected and that value determines the chance
of the words in a bigram or trigaram ( or ngram in general) to be together, i.e. we get information about mutual dependence
between the words. ]
Input and Output Description :

Our aim is to calculate a score for each bigram or trigram that determines the associativity between the words.
In this experiment, we write modules for calculating value based on true mutual information (tmi.pm), loglikelihood ratio for
trigrams (ll3.pm), and a new measure for bigrams (user2.pm) and trigrams (user3.pm).
Program, statistic.pl in NSP package calculates these values for each bigram or trigram, where we have to specify name of
the module that we are using, a file to write the results into and an input file that is in the format output by
count.pl in NSP package. [count.pl finds all ngrams in a file and writes them into the file specified, along with various
frequency values associated with each ngram]. Output of count.pl is used as input for statistic.pl.
Example:

If abc.txt is the text on which we want to anaylse,
then we can run a simple count.pl with the following syntax:
perl count.pl result.txt input.txt
Above syntax is for bigrams. If we need the same for trigrams, NSP package allows us to write,
perl count.pl ngram 3 result3.txt input.txt
[Other ways to run are specified in the package. ]
Once we get result.txt or result3.txt, containing all bigrams or trigrams along with some frequency values, we use that file
as input to statistic.pl, that will calculate score for each bigram or trigram and store the results in an output file
specified by the user. We should also specify the method that is to be used to calculate scores. To calculate scores using method
specified in module user2.pm, we can write the simple command,
perl statistic.pl user2 user2.res result.txt
For trigrams, we can run statistic.pl as
perl statistic.pl user3 ngram 3 user3.res result3.txt
Thus, we can compute scores that tell about association between words in an ngram and get results in a file specified by user.
Description of Method Used :

user2.pm :

In user2.pm, I used Odds Ratio as a measure of associativity.
[ Found in book, Categorical Data Analysis by ALAN AGRESTI, which I found by searching in library ]
If X<>Y is a bigram, then Odds Ratio for it is given by the formula,
Odds Ratio = ( frequency of X and Y occuring together * frequency of both X and Y not occuring together) /
( frequency of X occuring and Y not occuring * frequency of Y occuring and X not occuring)
Note : Above, when we mention about occurance of X or Y, we are mentioning occurance of X in the first position and occurance
of Y in second position for a bigram.
As we have frequency of X and Y occuring together in numerator, and frequency of both X and Y not occuring together (which
means that if this value increases, there is less chance that there will be bigrams that dont contain X in first position
and Y in second position) in numerator, if these values increases, intuitively, there is stronger association between
words and hence Odds Ratio will increase. Similarly, if frequency of either X or Y occuring in the correct position increases,
it means that frequency of X and Y not occuring together increases and hence association is low. As we have these frequencies
in denominator, Odds Ratio decreases, and hence it makes sense.
Thus, Odds Ratio is calculated for each bigram and we can know the association between two words in a bigram depending on
Odds Ratio. Greater the ratio, greater is the association between them.
Example :
Say, i<>j is a bigram, and we have, (first bigram in output given by count.pl when run on Brown.txt)
nij = number of times i<>j occured in the sample = 9181
ni+ = number of times i<>+ occured in the sample (second word can be anything) = 36043
n+j = number of times +<>j occured in the sample (first word can be anything) = 62690
total = total number of bigrams in the sample = 1336151
now we will get,
nibarj = number of times i occured in first place where j is not there in second place
= n+j  nij = 62690  9181 = 53509
nijbar = number of times j occured in second place where i is not there in first place
= ni+  nij = 36043  9181 = 26862
nibarjbar = number of times when neither of i and j occured = total  ni+  n+j + nij = 1336151  36043  62690 + 9181 = 1246599
Odds Ratio = nij * nibarjbar / (nijbar * nibarj)
= 9181 * 1246599 / ( 26862 * 53509) = 7.9625
user3.pm :

The same concept of Odds Ratio can be extended to apply for trigrams.
Let X<>Y<>Z be a trigram.
Mantel and Haenszel proposed the formula for calculating Odds Ratio for trigrams to be
Odds Ratio = (for all k, (n11k * n22k / n++k) ) / (for all k, (n12k * n21k / n++k) )
[Page 236 in the book, Categorical Data Analysis]
where
n111 = frequency of occurance of trigram XYZ,
n112 = frequency of occuracne of trigram where X,Y are in first and second positions and Z is not in third position,
n121 = frequency of occurance of trigram where X,Z are in first and third positions and Y is not in second position,
n122 = frequency of occurance of trigram where X occurns in first place and Y,Z are not in second and third positions,
.
.
.
. and so on
n++1 = n111 + n121 + n211 + n221 and
n++2 = n112 + n122 + n212 + n222
and k = 1,2 for trigram.
Odds Ratio makes sense for trigrams also as in bigrams as increase in value of numerator implies greater associativity intuitively.
Example :

total = 1071112
n111 = xyz = 2179
xpp = 35891
pyp = 35891
ppz = 35891
xyp = 3751
xpz = 2760
pyz = 3751
n121 = xybarz = xpz  xyz = 2760  2179 = 581
n211 = xbaryz = pyz  xyz = 3751  2179 = 1572
n112 = xyzbar = xyp  xyz = 3751  2179 = 1572
n221 = xbarybarz = ppz  n121  n211  xyz = 35891  581  1572  2179 = 31559
n122 = xybarzbar = xpp  n112  n121  xyz = 35891  1572  581  2179 = 31559
n212 = xbaryzbar = pyp  n112  n211  xyz = 35891  1572  1572  2179 = 30568
n222 = total  xpp  pyp  ppz + xyp + xpz + pyz  xyz = 1071112  35891  35891  35891 + 3751 + 2760 + 3751  2179 = 97152
npp1 = n111 + n121 + n211 + n221 = 2179 + 581 + 1572 + 31559 = 35284
npp2 = n112 + n122 + n212 + n222 = 1572 + 31559 + 30568 + 97152 = 160851
odds ratio = (npp2.n111.n221 + npp1.n112.n222) / (npp2.n121.n211 + npp1.n122.n212)
=
EXPERIMENT 1 :

Once user2.pm is implemented, it is run with NSP to perform analysis on the following.
TOP 50 COMPARISION :

> Number of Ranks :

On observing the ranks produced by user2.pm and tmi.pm, and considering the top 50 ranks, it is observed that there are only
40 ranks for the text I gave when we use tmi.pm. So, I couldnt even observe 50 ranks even though the text consists of more
than a million tokens. But there are 131972 ranks when user2 module is used.
Analysis : The reason for getting less number of ranks is the value of the score being zero for most of the bigrams. The score
is zero implies that (observed frequency of 2 words being together) / (expected value for that words being together)
is negative. Because only then the value is not calculated in the logic of the program. [log is calcualted only
for values greater than 0.] There is no chance that expected value is negative. [Since it is product of left and
right frequencies of a bigram divided by total number of bigrams and all these values are taken from count.pl and they
are never negative.]. SO, the only possibility is that observed frequency for 2 words being together can be zero but
never negative. So, this implies that if the value, (observed frequency of 2 words being together) /
(expected value for that words being together) is valid, then only possibility to get zero would be if observed frequency
is equal to expected frequency. [because log 1 = 0, and that is the only way we can get zero]
NOTE:

Formula used for tmi is:
score = for all X and Y ( (nij / n++) * log ( nij / eij) ) where log is to base 2.
Conclusion for this Analysis : Using tmi, we are observing that observed value is equal to expected in relatively more number of
cases which means that words in bigram are independent. Thus, Null Hypothesis can be accepted in this case.
> Comparision between user2 and tmi :

I observed the bigrams that are ranked high using tmi are ranked low using user2. If we are considering a particular text and
performing analysis on it, I feelt that user2 would be more interesting than tmi as it gives wider rankings, so that we can have
better intuition regarding words of that particular text. In my experiment, I observed that using user2, bigrams that are some sort
of names, or common phrases (atleast to that text) are ranked high. Using tmi, I observed that bigrams that are more common in
general language, for example, ".<>The<>" (ranked 1) and not to particular text are ranked high.
Hence, I feel that user2 would be significantly better than tmi for analysis of texts of particular author, or particular category,
etc but tmi would be better than user2 to perform analysis on a language rather than on a specific text.
CUTOFF POINT :

tmi.pm : For scores generated using tmi module, 0.0135 is the score for first rank in my experiment. And the score goes on
decreasing with increase in rank and at a stage it remains constant once it reaches zero. We can neglect all the bigrams
whose score is zero. This means that the words in the bigrams with score zero have no dependence or association and hence
we can neglect them. Hence the CutOff point will be the point where we encounter zero as value for measure.
user2.pm : For scores generated using user2 module, 1336150 is the score for first rank in my experiment. Thereon, score goes on
decreasing as rank increases but I didnt observe any cut off point beyond which I can neglect the values. I didnt find
a cut off as the scores are not normalized or within a certain boundary. When there are boundaries, we can find cutoff
values at the boundaries. But here as we lack specific boundaries, we are unable to determine the cutof point. This tells
us that this test not suitable for all texts. The reason is, as the size of corpora increases, the value of odds ratio
goes on increasing and this is not very desirable. Escpecially when we have cases where a bigram occurs only once
in a very large corpora, the value of (frequency of occurances without X and Y) in the numerator increases considerably
and we dont have a way to normalize this. Thus, we can tell that Odds Ratio test is suitable only for texts of reasonable
size with accpetable frequency values for the occurances of bigrams. Also, we can conclude that we didnt find a cutoff
using user2.pm as it is a variation of size. When we have a variant of size, we can have a cut off generally by using
a function of size, and cutoff changes with size. But here, it is not clear what the relation is between
size of the text and Odds Ratio. Hence, we am unable to determine a cutoff for this experiment.
RANK COMPARISION :

Comparision with ll.pm :
The rank corelation coefficient on comparing ll.pm with tmi.pm is 0.9668
The rank corelation coefficient on comparing ll.pm with user2.pm is 0.8349
Comparision with mi.pm :
The rank corelation coefficient on comparing mi.pm with tmi.pm is 0.9668
The rank corelation coefficient on comparing mi.pm with user2.pm is 1.0000
Observing the rank corelation coefficients above, we can conclude that user2 gives exactly the same rankings as mi.pm, which
means that there is high relatedness between mi.pm and user2.pm. Also, user2.pm has high relatedness to ll.pm. Thus we can
tend to infer that user2 is a good measure of association as it has good corelation with the existing measures. This is true
as long as the drawback observed when determining cufoff point above is not significant.
> Related but Negative value :

One interesting observation made is that comparing with tmi.pm gives us negative coefficient. But if we observe actual data in
the results of both modules, they have very similar rankings. For example, consider first 10 rankings given by ll.pm and
tmi.pm :
ll.pm:

1336151
.<>The<>1 25043.4196 4961 49673 6921
of<>the<>2 18856.1038 9181 36043 62690
.<>He<>3 13183.4486 2421 49673 2991
in<>the<>4 11393.6840 5330 19581 62690
.<>It<>5 9139.3156 1718 49673 2184
,<>and<>6 9072.9347 5530 58974 27952
.<>In<>7 6844.0270 1320 49673 1736
,<>but<>8 6309.5959 1648 58974 3006
the<>the<>9 6126.7892 3 62690 62690
.<>But<>10 6064.5057 1115 49673 1371
tmi.pm :

1336151
.<>The<>1 0.0135 4961 49673 6921
of<>the<>2 0.0102 9181 36043 62690
.<>He<>3 0.0071 2421 49673 2991
in<>the<>4 0.0062 5330 19581 62690
.<>It<>5 0.0049 1718 49673 2184
,<>and<>5 0.0049 5530 58974 27952
.<>In<>6 0.0037 1320 49673 1736
,<>but<>7 0.0034 1648 58974 3006
.<>But<>8 0.0033 1115 49673 1371
the<>the<>8 0.0033 3 62690 62690
to<>be<>9 0.0032 1631 25725 6340
on<>the<>10 0.0031 2196 6405 62690
So, it is clear that both the measures must have high relatedness as the rankings are nearly the same. That means, we have to
get a value around +1(expected). But we observed a value around 1 (observed value = 0.9668). This is an interesting observation that
can be made. The reason for this is, tmi is normailized and so we could observe a cutoff at 0 and all the bigrams thereafter
can be neglected as all of them have the same rank (only 40 ranks are observed in this experiment). But ll and mi dont show the
same rank to so many bigrams and moreover the scores do depend on size of corpora and we have many rankings. (for example,
106744 rankings are observed in ll). Since difference in rankings is the criteria in calculating the corelation coefficient,
these differences produce a negative value.
Therefore, though these methods seem to have high corelatedness, they actually end up having high reverse corelatedness.
OVERALL RECOMMENDATION :

Based on all the observations made, I think the best method for identifying significant collocations in a large corpora
among mi.pm, ll.pm, user2.pm and tmi.pm would be "mi.pm" or "ll.pm"
user2.pm might not be appropriate in cases where most bigrams dont occur more than once, or where the size of corpora is
very large, for the reasons explained earlier during observations.
tmi.pm would also be inappropriate compared to mi.pm as it quickly converges to cutoff, which excludes bigrams with scores
equal to cut off from consideration.
EXPERIMENT 2 :

Once user3.pm is implemented, the following are done.
TOP 50 COMPARISION :

On observing the top 50 bigrams after running NSP on a corpus using ll3.pm and user3.pm, I see that the two methods are
significantly different from each other as the bigrams ranked below 50 using ll3.pm and bigrams ranked below 50 using user3.pm
are significantly different.
Also, it is observed that score values is relatively high using user3 compared to scores using ll3. This could tell
that user3.pm is more affected by size of the text
CUTOFF POINT :

No cut off points are observed for either of the methods, ll3.pm or user3.pm, for reasons similar to the bigram case.
OVERALL RECOMMENDATION :

Among ll3.pm and user3.pm, a better method would be ll3.pm as we get very high ratios using user3.pm as size of corpora increases
and user3 is especially degraded when many trigrams occur only once which causes the ratio to increase considerably, as with
the case in user2.pm for bigrams.
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
#[experiments.txt]
# Assignment # 2 : CS8761 Corpus Based NLP
# Title : Report for "The Great Collocation Hunt"
# Author : Deodatta Bhoite (bhoi0001@d.umn.edu)
# Version : 1.0
# Date : 10102002
#
Introduction

"You shall know a word by the company it keeps!"  J.R.Firth
The most trivial way of finding out collocations in a text would be
counting the number of joint occurrences of two words. However, it is
seen that we can find more significant collocations by applying
various tests of association to the frequency data (contingency table)
of the corpus.
In this assignment we investigate various tests of association to
identify 2word and 3word collocations.
Experiment 1

Corpus for this experiment has been taken from Project Gutenberg and
includes the following files:
Alice in Wonderland (alice.txt)
Black Beauty (blackbeauty.txt)
Robinson Crusoe (crusoe1.txt)
Further Adventures of Robinson Crusoe (crusoe2.txt)
Dracula (dracula.txt)
King James New and Old Testament (bible.txt)
Vikram and the Vampire (vikram.txt)
The total token size of the corpus is around 1.37M tokens.
However, I remove the digits and the punctuation from the data by
specifying a token file to the count.pl.
The digits and punctuation are filtered by the following regular
expression.
[token.txt]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/[azAZ]+/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We then count find the bigrams and frequency data related to them
by the following command:
% count.pl token token.txt test2.cnt /home/cs/bhoi0001/CS8761/nsp/
The top 50 collocations based on frequency counts are as follows. We
observe that they are not significant, which is probably because they
do not take into consideration the independent occurrence of the words
forming the bigrams.
[test2.cnt]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354222
of<>the<>14731 48646 89427
in<>the<>6985 20407 89427
the<>LORD<>5964 89427 6653
and<>the<>5282 59758 89427
to<>the<>3830 30166 89427
all<>the<>2661 8710 89427
shall<>be<>2551 10449 10176
And<>the<>2249 13476 89427
I<>will<>2044 23430 4922
for<>the<>2028 12188 89427
unto<>the<>2020 8944 89427
out<>of<>1953 4357 48646
of<>Israel<>1701 48646 2578
and<>I<>1700 59758 23430
from<>the<>1674 5455 89427
that<>I<>1665 20250 23430
the<>king<>1662 89427 2695
said<>unto<>1643 6088 8944
on<>the<>1643 4949 89427
And<>he<>1637 13476 15266
with<>the<>1624 10543 89427
of<>his<>1594 48646 12329
I<>have<>1575 23430 6484
into<>the<>1565 3251 89427
to<>be<>1564 30166 10176
by<>the<>1520 4865 89427
and<>he<>1494 59758 15266
I<>had<>1445 23430 6896
that<>he<>1420 20250 15266
son<>of<>1418 2305 48646
children<>of<>1403 1936 48646
it<>was<>1400 12324 11852
the<>children<>1346 89427 1936
and<>they<>1337 59758 10570
the<>house<>1325 89427 2368
the<>son<>1322 89427 2305
the<>land<>1297 89427 1885
upon<>the<>1274 3952 89427
the<>people<>1266 89427 2470
I<>was<>1262 23430 11852
I<>am<>1255 23430 1430
that<>the<>1227 20250 89427
at<>the<>1178 4608 89427
of<>them<>1134 48646 9649
unto<>him<>1126 8944 9642
in<>a<>1120 20407 18937
him<>and<>1107 9642 59758
and<>his<>1101 59758 12329
and<>to<>1090 59758 30166
came<>to<>1080 3349 30166
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have implemented "true" mutual information and Poisson's collocation measure
for finding collocations in the corpus. Description of their implementations
can be found in the source code comments. We will see the results of the
association measures (AM) and their analysis in the following sections.
(a) Top 50 Comparison
        
The two word collocations for True mutual information are found as follows:
% statistic.pl tmi test2.tmi test2.cnt
There are only 44 ranks in the TMI output. So we show top 50 trigrams as they
are ordered in the output.
test2.tmi //TMI output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354222
the<>LORD<>1 0.0152 5964 89427 6653
of<>the<>2 0.0143 14731 48646 89427
shall<>be<>3 0.0075 2551 10449 10176
in<>the<>4 0.0074 6985 20407 89427
the<>the<>5 0.0067 2 89427 89427
I<>will<>6 0.0054 2044 23430 4922
said<>unto<>7 0.0052 1643 6088 8944
thou<>shalt<>8 0.0051 1024 5017 1627
I<>am<>9 0.0049 1255 23430 1430
of<>Israel<>10 0.0043 1701 48646 2578
out<>of<>11 0.0039 1953 4357 48646
children<>of<>12 0.0038 1403 1936 48646
son<>of<>13 0.0034 1418 2305 48646
thou<>hast<>14 0.0033 670 5017 1054
Van<>Helsing<>15 0.0031 305 305 305
I<>have<>15 0.0031 1575 23430 6484
the<>king<>16 0.0030 1662 89427 2695
and<>and<>17 0.0029 2 59758 59758
according<>to<>18 0.0027 734 799 30166
the<>children<>18 0.0027 1346 89427 1936
And<>he<>18 0.0027 1637 13476 15266
it<>was<>19 0.0026 1400 12324 11852
I<>had<>19 0.0026 1445 23430 6896
the<>land<>19 0.0026 1297 89427 1885
all<>the<>20 0.0025 2661 8710 89427
unto<>him<>20 0.0025 1126 8944 9642
had<>been<>21 0.0024 627 6896 1579
to<>pass<>21 0.0024 719 30166 906
into<>the<>22 0.0023 1565 3251 89427
of<>and<>23 0.0022 36 48646 59758
Lord<>GOD<>23 0.0022 290 1169 300
the<>son<>23 0.0022 1322 89427 2305
they<>were<>23 0.0022 921 10570 5349
came<>to<>23 0.0022 1080 3349 30166
the<>to<>23 0.0022 1 89427 30166
the<>house<>23 0.0022 1325 89427 2368
the<>earth<>24 0.0020 910 89427 1142
I<>could<>24 0.0020 782 23430 1969
he<>said<>24 0.0020 1022 15266 6088
unto<>them<>25 0.0019 968 8944 9649
could<>not<>25 0.0019 609 1969 10847
the<>people<>25 0.0019 1266 89427 2470
of<>of<>25 0.0019 9 48646 48646
to<>be<>25 0.0019 1564 30166 10176
Thus<>saith<>25 0.0019 305 514 1262
began<>to<>26 0.0018 542 736 30166
house<>of<>26 0.0018 978 2368 48646
saith<>the<>26 0.0018 891 1262 89427
a<>little<>26 0.0018 588 18937 1189
don<>t<>26 0.0018 219 219 667
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The two word collocations for Poisson's AM are found as follows:
% statistic.pl user2 test2.u2 test2.cnt
The 50 ranks are as follows:
test2.u2 //Poisson's AM output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354222
the<>the<>1 5888.7006 2 89427 89427
and<>and<>2 2621.8905 2 59758 59758
the<>to<>3 1984.4361 1 89427 30166
of<>and<>4 1966.1513 36 48646 59758
of<>of<>5 1693.0572 9 48646 48646
the<>I<>6 1539.8723 1 89427 23430
I<>the<>7 1269.7665 68 23430 89427
a<>the<>8 1243.3868 1 18937 89427
to<>and<>9 1078.9410 63 30166 59758
to<>of<>10 1076.6269 1 30166 48646
of<>to<>11 1048.2659 6 48646 30166
the<>he<>12 973.1853 6 89427 15266
Project<>Gutenberg<>13 910.1206 113 163 113
he<>the<>14 904.4241 22 15266 89427
his<>the<>15 807.4520 1 12329 89427
that<>and<>16 803.8059 19 20250 59758
I<>and<>17 803.2322 61 23430 59758
I<>of<>18 798.3679 8 23430 48646
of<>I<>19 764.5523 16 48646 23430
years<>old<>20 761.5457 161 752 965
Holy<>Ghost<>21 739.8175 91 152 91
lifted<>up<>22 716.3992 155 196 3976
the<>not<>23 709.7152 1 89427 10847
on<>board<>24 696.6052 157 4949 192
meat<>offering<>25 681.2251 122 322 730
Madam<>Mina<>26 667.7936 86 92 205
thus<>saith<>27 666.3556 139 467 1262
to<>to<>28 654.2245 3 30166 30166
of<>in<>29 650.7003 18 48646 20407
they<>the<>30 643.4693 11 10570 89427
in<>of<>31 636.3871 22 20407 48646
o<>clock<>32 626.1584 75 95 97
sin<>offering<>33 614.3277 118 455 730
it<>the<>34 582.5441 67 12324 89427
place<>where<>35 582.2297 147 1317 1092
thy<>servant<>36 581.0534 165 4529 554
the<>all<>37 568.8163 1 89427 8710
high<>places<>38 563.4935 98 545 295
thine<>hand<>39 541.2238 145 923 1935
of<>he<>40 536.4604 2 48646 15266
rose<>up<>41 535.2729 123 205 3976
my<>lord<>42 531.6304 154 8795 286
at<>least<>43 525.9582 130 4608 254
pass<>when<>44 523.7818 167 906 4141
burnt<>offerings<>45 520.0814 86 388 271
Dr<>Seward<>46 513.3245 64 139 79
Mrs<>Harker<>47 511.0650 65 102 128
Mock<>Turtle<>48 509.2428 56 56 59
thou<>mayest<>49 497.1772 109 5017 117
their<>fathers<>50 489.2345 153 5767 561
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We find that the collocations in the Poisson's AM output are more
interesting than that of the TMI output. ex. Holy Ghost, Mock Turtle,
Madam Mina, thy servant, my lord, high places, etc.
However, the Poisson's AM top ranks have collocations like "the the"
and "and and". This is due to the high value of 'lamda' which is E11.
(b) Cutoff Point
      
The cutoff for TMI output is where rank 38 starts (221st bigram). There are
only 44 ranks in the output, out of which the first 221 bigrams have 38 ranks
(ranks change faster), whereas the rest of the huge number of bigrams have
just 6 distinct ranks distributed between them.(ranks change slower). i.e.
There is a clustering of scores.
pass<>when<>38 0.0006 167 906 4141
There is no particular cutoff point for the Poisson's AM's output, but we can
call the 1st rank as a cutoff point essentially because it's score is unusually
high.
(c) Rank Comparison
       
The rank comparison matrix for the measures is as shown below.
TMI USER2
LL 0.9046 0.9773
MI 0.9055 0.6924
We observe that the output of TMI is significantly different (rather opposite)
than LL and MI. The Poisson's ratio is similar to loglikelihood [Krenn 00]. Hence,
the high rank coefficient is justified. Poisson's AM and MI are strongly correlated
but not exactly same.
We also observe that though the rank coefficient of TMI and LL is inversely
related, the top 30 bigrams are almost the same.
(d) Overall Recommendation
           
I found out that the Pointwise mutual information gives very good results.
Most collocations (in terms of natural language, not statistically proved) would
probably occur once and would occur together. Such usage should be highly ranked.
For example:
menace<>Monster<>1 20.3690 1 1 1
blending<>contradictions<>1 20.3690 1 1 1
fleeting<>diorama<>1 20.3690 1 1 1
Hence, I would recommend Pointwise mutual information despite its drawbacks. My
judgment is based on the results of my experiments.
Experiment 2

The corpus used for the second experiment is the same as that used for the
first experiment. However, in this experiment we find the trigrams (rather
3word collocations) in the corpus.
The data is preprocessed using the same token file to remove the nonalpha
characters.
Note: Before you run the following command you will probably need to change
 limits for maximum file size and/or cpu time. This can be done by
ulimit f and ulimit t respectively. Specify any high arbitrary value
or make it unlimited.
% count.pl ngram 3 token token.txt test3.cnt /home/cs/bhoi0001/CS8761/nsp/
The top 50 joint frequency rankings are shown below.
[test3.cnt]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354215
of<>the<>LORD<>1626 48646 89427 6653 14731 1630 5964
the<>son<>of<>1300 89427 2305 48646 1322 25368 1418
the<>children<>of<>1263 89427 1936 48646 1346 25368 1403
out<>of<>the<>979 4357 48646 89427 1953 1290 14731
the<>house<>of<>901 89427 2368 48646 1325 25368 978
children<>of<>Israel<>650 1936 48646 2578 1403 650 1701
the<>land<>of<>618 89427 1885 48646 1297 25368 646
saith<>the<>LORD<>615 1262 89427 6653 891 615 5964
the<>sons<>of<>504 89427 1110 48646 517 25368 598
unto<>the<>LORD<>488 8944 89427 6653 2020 489 5964
the<>LORD<>and<>484 89427 6653 59758 5964 8114 520
it<>came<>to<>464 12324 3349 30166 501 816 1080
came<>to<>pass<>462 3349 30166 906 1080 462 719
and<>all<>the<>461 59758 8710 89427 1037 3875 2661
and<>I<>will<>456 59758 23430 4922 1700 633 2044
said<>unto<>him<>454 6088 8944 9642 1643 504 1126
And<>he<>said<>452 13476 15266 6088 1637 1202 1022
the<>king<>of<>427 89427 2695 48646 1662 25368 851
And<>the<>LORD<>408 13476 89427 6653 2249 409 5964
the<>hand<>of<>406 89427 1935 48646 453 25368 449
of<>the<>house<>393 48646 89427 2368 14731 463 1325
And<>it<>came<>384 13476 12324 3349 622 533 501
of<>the<>children<>364 48646 89427 1936 14731 401 1346
said<>unto<>them<>364 6088 8944 9649 1643 383 968
the<>midst<>of<>334 89427 384 48646 383 25368 334
the<>word<>of<>331 89427 952 48646 448 25368 375
the<>name<>of<>329 89427 1105 48646 350 25368 340
it<>shall<>be<>326 12324 10449 10176 614 667 2551
in<>the<>land<>325 20407 89427 1885 6985 381 1297
according<>to<>the<>323 799 30166 89427 734 341 3830
and<>they<>shall<>320 59758 10570 10449 1337 1217 894
of<>the<>land<>319 48646 89427 1885 14731 376 1297
and<>said<>unto<>314 59758 6088 8944 1043 641 1643
of<>the<>earth<>308 48646 89427 1142 14731 314 910
LORD<>thy<>God<>296 6653 4529 4614 304 675 351
the<>LORD<>thy<>296 89427 6653 4529 5964 319 304
Thus<>saith<>the<>291 514 1262 89427 305 313 891
the<>king<>s<>288 89427 2695 3699 1662 977 300
all<>the<>people<>287 8710 89427 2470 2661 341 1266
house<>of<>the<>287 2368 48646 89427 978 436 14731
and<>in<>the<>286 59758 20407 89427 839 3875 6985
he<>said<>unto<>283 15266 6088 8944 1022 466 1643
the<>LORD<>And<>277 89427 6653 13476 5964 1943 281
one<>of<>the<>270 3676 48646 89427 651 394 14731
word<>of<>the<>268 952 48646 89427 375 310 14731
in<>the<>midst<>267 20407 89427 384 6985 268 383
before<>the<>LORD<>265 2676 89427 6653 742 265 5964
the<>Lord<>GOD<>262 89427 1169 300 731 264 290
LORD<>of<>hosts<>244 6653 48646 300 263 244 285
the<>LORD<>of<>240 89427 6653 48646 5964 25368 263
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We cannot say by looking at the results of the frequency rankings that the
collocations found are not significant.
I have implemented the loglikelihood AM and Poisson's AM for trigrams. The
description of their implementations can be found in their respective source
codes. In the subsequent sections, we note the results of these two AMs on
the corpus and compare and analyze the findings.
(a) Top 50 Comparison
        
The 3word collocations are found using log likelihood for trigrams as
follows:
% statistic.pl ngram 3 ll3 test3.ll3 test3.cnt
The top 50 3word collocations found using log likelihood AM are as shown:
[test3.ll3] // Log likelihood output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354215
the<>LORD<>of<>1 111362.6616 240 89427 6653 48646 5964 25368 263
the<>children<>of<>2 88547.8405 1263 89427 1936 48646 1346 25368 1403
the<>king<>of<>3 88004.4561 427 89427 2695 48646 1662 25368 851
the<>son<>of<>4 87925.7509 1300 89427 2305 48646 1322 25368 1418
the<>house<>of<>5 85487.1624 901 89427 2368 48646 1325 25368 978
the<>land<>of<>6 85410.9035 618 89427 1885 48646 1297 25368 646
the<>earth<>of<>7 84784.2917 3 89427 1142 48646 910 25368 3
the<>people<>of<>8 84239.8952 139 89427 2470 48646 1266 25368 175
the<>same<>of<>9 83991.7332 1 89427 723 48646 662 25368 1
the<>Lord<>of<>10 83229.8499 15 89427 1169 48646 731 25368 25
the<>sons<>of<>11 83141.9206 504 89427 1110 48646 517 25368 598
the<>midst<>of<>12 83020.0501 334 89427 384 48646 383 25368 334
the<>world<>of<>13 82577.2817 7 89427 595 48646 468 25368 16
the<>one<>of<>14 82412.0136 4 89427 3676 48646 141 25368 651
the<>city<>of<>15 82330.9315 115 89427 977 48646 575 25368 158
the<>part<>of<>16 82258.6409 16 89427 574 48646 20 25368 341
the<>sea<>of<>17 82238.1683 20 89427 723 48646 474 25368 28
the<>priest<>of<>18 82170.8559 9 89427 576 48646 414 25368 18
the<>other<>of<>19 82114.2863 5 89427 1397 48646 587 25368 20
the<>first<>of<>20 82101.3345 14 89427 1185 48646 555 25368 29
+the<>word<>of<>21 82025.5217 331 89427 952 48646 448 25368 375
the<>congregation<>of<>22 81959.0340 73 89427 364 48646 328 25368 82
the<>ship<>of<>23 81938.7633 3 89427 617 48646 390 25368 9
the<>tabernacle<>of<>24 81811.1994 174 89427 327 48646 286 25368 175
the<>door<>of<>25 81774.1339 108 89427 538 48646 375 25368 116
the<>hand<>of<>26 81690.8952 406 89427 1935 48646 453 25368 449
the<>end<>of<>27 81681.8431 147 89427 545 48646 260 25368 260
the<>name<>of<>28 81680.6279 329 89427 1105 48646 350 25368 340
the<>ground<>of<>29 81672.5320 2 89427 442 48646 304 25368 5
the<>Levites<>of<>30 81664.8878 3 89427 265 48646 242 25368 3
the<>wilderness<>of<>31 81662.8149 53 89427 314 48646 277 25368 54
the<>tribe<>of<>32 81648.1790 176 89427 242 48646 179 25368 207
the<>ark<>of<>33 81629.7452 139 89427 231 48646 219 25368 147
the<>inhabitants<>of<>34 81537.1688 157 89427 228 48646 187 25368 178
the<>Son<>of<>35 81468.1143 127 89427 293 48646 162 25368 203
the<>altar<>of<>36 81467.6858 58 89427 379 48646 277 25368 67
the<>way<>of<>37 81452.9843 169 89427 1463 48646 511 25368 229
the<>priests<>of<>38 81433.5951 11 89427 420 48646 272 25368 17
the<>whole<>of<>39 81424.3040 12 89427 478 48646 287 25368 16
the<>law<>of<>40 81419.8516 97 89427 588 48646 333 25368 111
the<>Jews<>of<>41 81417.7290 3 89427 253 48646 212 25368 3
the<>sword<>of<>42 81402.8199 24 89427 471 48646 288 25368 31
the<>morning<>of<>43 81366.9030 1 89427 505 48646 274 25368 2
the<>chief<>of<>44 81365.9421 70 89427 319 48646 236 25368 90
the<>field<>of<>45 81361.1708 27 89427 332 48646 243 25368 32
the<>top<>of<>46 81353.0009 132 89427 177 48646 163 25368 133
the<>wicked<>of<>47 81351.9657 5 89427 395 48646 249 25368 5
the<>rest<>of<>48 81347.0400 157 89427 587 48646 310 25368 164
the<>island<>of<>49 81314.2605 7 89427 301 48646 219 25368 10
the<>sight<>of<>50 81294.1331 187 89427 504 48646 207 25368 226
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The 3word collocations are found using Poisson's AM as follows:
% statistic.pl ngram 3 user3 test3.u3 test3.cnt
The top 50 3word collocations found using Poisson's AM are as shown:
test.u3 //user3 measure output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1354215
of<>the<>LORD<>1 2955.2691 1626 48646 89427 6653 14731 1630 5964
the<>children<>of<>2 2915.5269 1263 89427 1936 48646 1346 25368 1403
the<>son<>of<>3 2906.3098 1300 89427 2305 48646 1322 25368 1418
children<>of<>Israel<>4 2437.1633 650 1936 48646 2578 1403 650 1701
saith<>the<>LORD<>5 1941.7497 615 1262 89427 6653 891 615 5964
came<>to<>pass<>6 1878.7467 462 3349 30166 906 1080 462 719
the<>house<>of<>7 1836.9888 901 89427 2368 48646 1325 25368 978
out<>of<>the<>8 1738.1924 979 4357 48646 89427 1953 1290 14731
said<>unto<>him<>9 1445.7324 454 6088 8944 9642 1643 504 1126
it<>came<>to<>10 1282.3006 464 12324 3349 30166 501 816 1080
And<>he<>said<>11 1241.8753 452 13476 15266 6088 1637 1202 1022
the<>land<>of<>12 1213.9892 618 89427 1885 48646 1297 25368 646
Thus<>saith<>the<>13 1182.4689 291 514 1262 89427 305 313 891
And<>it<>came<>14 1179.5944 384 13476 12324 3349 622 533 501
the<>Lord<>GOD<>15 1131.4398 262 89427 1169 300 731 264 290
said<>unto<>them<>16 1118.7897 364 6088 8944 9649 1643 383 968
LORD<>thy<>God<>17 1075.9444 296 6653 4529 4614 304 675 351
the<>sons<>of<>18 1072.1146 504 89427 1110 48646 517 25368 598
unto<>the<>LORD<>19 1006.5100 488 8944 89427 6653 2020 489 5964
LORD<>of<>hosts<>20 907.1557 244 6653 48646 300 263 244 285
land<>of<>Egypt<>21 897.9751 227 1885 48646 612 646 227 422
+and<>I<>will<>22 866.0937 456 59758 23430 4922 1700 633 2044
saith<>the<>Lord<>23 849.4530 239 1262 89427 1169 891 241 731
it<>shall<>be<>24 835.0506 326 12324 10449 10176 614 667 2551
the<>midst<>of<>25 819.0449 334 89427 384 48646 383 25368 334
the<>king<>s<>26 775.3358 288 89427 2695 3699 1662 977 300
he<>said<>unto<>27 769.2952 283 15266 6088 8944 1022 466 1643
And<>thou<>shalt<>28 760.0523 212 13476 5017 1627 244 215 1024
according<>to<>the<>29 745.5014 323 799 30166 89427 734 341 3830
in<>the<>midst<>30 740.8262 267 20407 89427 384 6985 268 383
And<>the<>LORD<>31 721.3574 408 13476 89427 6653 2249 409 5964
the<>hand<>of<>32 706.9379 406 89427 1935 48646 453 25368 449
the<>king<>of<>33 683.5401 427 89427 2695 48646 1662 25368 851
in<>the<>land<>34 675.1543 325 20407 89427 1885 6985 381 1297
all<>the<>people<>35 661.7601 287 8710 89427 2470 2661 341 1266
the<>word<>of<>36 659.9338 331 89427 952 48646 448 25368 375
I<>pray<>thee<>37 657.1515 162 23430 356 3926 224 409 172
I<>could<>not<>38 656.3701 229 23430 1969 10847 782 1434 609
and<>said<>unto<>39 655.6299 314 59758 6088 8944 1043 641 1643
of<>the<>house<>40 638.2263 393 48646 89427 2368 14731 463 1325
the<>LORD<>thy<>41 637.2239 296 89427 6653 4529 5964 319 304
the<>name<>of<>42 630.4330 329 89427 1105 48646 350 25368 340
before<>the<>LORD<>43 625.5478 265 2676 89427 6653 742 265 5964
of<>the<>children<>44 613.8381 364 48646 89427 1936 14731 401 1346
I<>don<>t<>45 604.2793 120 23430 219 667 120 256 219
come<>to<>pass<>46 601.0908 164 2653 30166 906 523 164 719
and<>thou<>shalt<>47 582.1768 206 59758 5017 1627 344 225 1024
to<>pass<>when<>48 576.4219 167 30166 906 4141 719 227 167
of<>the<>earth<>49 574.9629 308 48646 89427 1142 14731 314 910
answered<>and<>said<>50 568.6228 180 602 59758 6088 190 180 1043
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We can see that "the _ of" trigrams topping the list of the log likelihood
AM output. This trend continues for rank 1777 (trigram #2810).This is due
to the high frequency of 'the' and 'of'.
Since we only consider the ratio of joint frequency and expected value for
it if the 3 words were independent, this trend does not necessarily reflect
in the Poisson's AM output.
We observe that Poisson's AM has found better 3word collocations.
ex. land of Egypt, children of Egypt, lord of hosts.
We also observe that the top 50 frequency rankings and Poisson's AM rankings
are somewhat similar.
(b) Cutoff Point
      
There does exist a cutoff point in the LL3 output as pointed out earlier.
The cutoff exists at rank 1777, before which the trigrams are of kind
"the _ of" and after which the other trigrams occur. As said earlier, this
is due to the high frequency of 'the' and 'of'. We could say that the LL3
is influenced by the individual frequencies of the words in the trigrams.
As for the output of Poisson's AM, I cannot find any interesting patterns.
We do not claim that there are no patterns in the output, but at least they
are not readily discernible to the eye unaided by statistics or mining tools.
I looked for clustering of ranks and wordtype trigrams clustering, but could
not find anything interesting. We could say that the individual frequencies
(like high frequencies of 'the' and 'of') do not affect Poisson's output as
much as they do affect LL3 output.
(c) Rank Comparison
       
The rank comparison of LL3 and Poisson's AM for trigrams was done as follows:
% rank.pl test3.ll3 test3.u3
Rank correlation coefficient = 0.1813
The negative value suggests a reverse relation, however since the value is
closer to zero, we can say that they are more of less unrelated.
The log likelihood as we have seen in part (c) is more similar to the Poisson's
test, but our results do not confirm this. This would probably be because we
only consider joint frequency of the trigram in Poisson's test. But, the
results are better than loglikelihood, at least in this case (and the several
other I tried).
(d) Overall Recommendation
           
From the results of my experiments, I would say that the 3word collocations
found by the Poisson's AM are more significant than those found by loglikelihood
AM. However, we do not have enough data or theory to support this statement. Our
conclusions regarding the best measure for finding collocations are further limited
by the fact that we have considered only two of the innumerous AMs.++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Natural Language Processing CS8761
Bridget Thomson McInnes
10 October 2002
EXPERIMENT 1:

This experiment required the implementation of two modules to be used in conjunction with the Ngram Statistic
Package. The two modules that were implemented calculate the True Mutual Information and the Symmetric Lambda
measure. The modules were named tmi.pm and user2.pm respectively.
True Mutual Information calculates how much one random variable tell about another. The formula that was
used is:
I(X;Y) = Sum( P(x,y) * lg( P(x,y) / P(x)P(y) ) ) = H(X)  H(X,Y).
This states that "the information about X tells about Y is the uncertainty in X plus the uncertainty about Y minus
the uncertainty in both X and Y" [1].
This formula can be rewritten to correspond to the information of a contingency table. Given a contingency table:
word2 ! !word2
______________________
  
word1  n11  n12  n1p
  

  
!word1  n21  n22  n2p
  
 
np1 np2  npp
The formula is: I(X;Y) = Sum( nij/n++ * lg( nij/eij ) )
Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell.
The calculation of True Mutual Information for the contingency table will give the information word1 tells about
word2's uncertainty in word1 plus the uncertainty about word2 minus the uncertainty in both word1 and word2.
Symmetric Lambda (GoodmanKruskal Lambda) is interpreted as the average of the probable improvement in predicting
the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row
variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage
reduction in errors in predicting the dependent given knowledge of the independent" [3].
The formula used to calculate lambda is:
lambda = ( rF + cF  Fr  Fr ) / ( (2 * N)  Fr  Fc ), where
rF = Sum of the maximum frequency in each row
cF = Sum of the maximum frequency in each column
Fr = maximum marginal row value
Fc = maximum marginal column value
N = total bigram count
Lambda is "the percent one reduces errors in guessing the value of the dependent variable when
one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the
marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent
variable are known for each value of the independent variable" [3].
"A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating
the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own
frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent
variable helps estimate the value of the dependent variable.
PLEASE NOTE: the experiment was run using the complete works of Shakespeare
found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare
TOP 50 COMPARISONS:

Which measure seems better at identifing significant or interesting collocations?
The tmi module and the user2 module, which I will call the lambda module from here on out, returned extremely
different results when comparing their bigrams. There is not a single overlap when comparing the top 50 bigrams
of each modules output. Meaning combining the top 50 bigrams of each module there are 100 unique bigrams.
The tmi module returned quite a few Proper Noun collocations, for example:
KING<>RICHARD<>
MARK<>ANTONY<>
HENRY<>VI<>
PRINCE<>HENRY<>
DOMITIUS<>ENOBARBUS<>
QUEEN<>MARGARET<>
DON<>PEDRO<>
I found this to be very good. I believe those are collocations that are desired. The lambda module did not
return any Proper Noun collocations.
The tmi module's output returned very few formal collocations, for example:
to<>be<>
my<>lord<>
These are good collocations because they represent the type of text that was used. The text that was used for
this experiment was the complete works of Shakespeare. The words "my lord" and "to be" tend
to occur often together throughout Shakespeare's writings. But there were only a few of that type of
collocation.
The tmi module returned 29 out of the top 50 bigrams containing function words or punctuation while
the lambda module return zero. These are bigrams with an expected high frequency. Although, function word
collocations are not 'interesting' collocations, it is interesting to note that the lambda module did not
return any. Lambda indicates the percentage of error reduction in guessing the value of the dependent variable
when knowing the independent variable. This is because when a collocations has a lower frequency value the
lambda will move closer to one while when a bigram has a higher frequency value the lambda moves closer
to zero. Because of this, the lambda module's output did not return any interesting bigrams in the top 50
ranks.
In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared. For example:
rich<>hangings<>
blessed<>wood<>
sweet<>Exeter<>
rude<>hands<>
secretly<>open<>
desert<>country<>
widow<>sister<>
These collocations were very interesting. Although no Proper Name collocations appeared in the first 50 bigrams
of the lambda module's output, there were quite a few 'true' collocations that appeared. More true collocations
appeared in the last 50 bigrams of the lambda output than appeared in the first 50 bigrams of the tmi module's
output.
How would you characterized the top 50 bigrams found by each module?
I would characterize the tmi module's top 50 bigrams as consisting mostly of Proper Noun collocations and
function word collocations. It was very good at returning Proper Noun collocations which is interesting.
I would characterize the lambda module's top 50 bigrams as very poor but I would characterized the last 50
bigrams as consisting mostly of interesting bigrams  'true' bigrams. Bigrams that would be considered
significant.
Is one of these measures significantly 'better' or 'worse' than the other? Why?
Determining whether the True Mutual Information measure is better or worse than the Lambda Symmetric measure is
difficult. The True Mutual Information measure returned many Proper Noun collocations which are considered very
significant collocations. Proper Noun collocations are collocations that a measure to distinguish. But the Lambda
Symmetric measure return collocations that where much more interesting, for example "desert country". This describes
something in the about the text that is being analyzed. Because of this I would have to say that the Lambda Symmetric
measure performs 'better' than the True Mutual Information measure. This opinion is based on the comparing the top 50
bigrams in the tmi module's output to the last 50 bigrams in the Lambda Symmetric module.
CUTOFF POINT:

Are there 'natural' cutoff points for scores, where bigrams above this value appear to be interesting or significant
while those below do not?
The lambda module's output values consist mostly of one and zero. At line 807, in the lambda output file,
the value becomes less than one. At line 23185, in the lambda output file, the value goes to zero. This
means that 22378 out of 357114 bigrams have a value that is not equal to one or zero. The group of bigrams
that have a score of zero have a lot of interesting bigrams, for example:
weary<>bones<>
country<>wench<>
While the group of bigrams that have a score of one on a whole do not have interesting bigrams, for example:
UNIMPROVED<>unreproved<>
noster<>Henricus<>
Although, while getting examples of uninteresting bigrams with a score of one I found:
helter<>skelter<>1 1.0000 2 2 2
Which, of course, is very interesting. I believe this shows that, significant bigrams are not confined to
bigrams with a lambda score of zero but significant bigrams seem to occur more often when the score is one.
As the lambda score grows closer to one the number of significant bigrams seems to decrease and likewise, as
the lambda score moves closer to zero the number of significant bigrams increase.
The highest score for True Mutual Informations is .0083 while the most common score greater than zero is
.0001. Significant collocations seems to be dispersed throughout the table. For example:
cunning<>hounds<>41 0.0000 2 155 66
chamber<>window<>40 0.0001 24 287 92
But Proper Name collocations have a rank of .0005 and higher which is a significant cutoff.
The majority of the scores fall below .0002. From line 1345 and on, the scores have a value of .0001 or
lower. This means that 355799 out of 357144 bigrams are .0001 or less. Any collocations with a score of
.0001 or lower is going to be lost.
RANK COMPARISON:

Comparison between ll.pm and tmi.pm.
%csdev0540% perl rank.pl ll.txt tmi.txt
%Rank correlation coefficient = 0.9268
Since the Rank correlation coefficient is close to 1 we can say that Loglikelihood Ratio and
True Mutual Information are almost exactly opposite of each other
Comparison between ll.pm and user2.pm.
%csdev0542% perl rank.pl ll.txt user2.txt
%Rank correlation coefficient = 0.6381
Again the Rank correlation coefficient is closer to 1 than 0. Therefore, we can say that
the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We
should notice though that the Symmetric Lambda measure is closer to the Loglikelihood Ratio
than True Mutual Information is.
Comparison between mi.pm and tmi.pm.
%csdev0543% perl rank.pl mi.txt tmi.txt
%Rank correlation coefficient = 0.9273
The Rank correlation coefficient is close to 1. Therefore, we can say that Pointwise Mutual
Information and True Mutual Information are opposite of each other. They are 'more opposite'
of each other than LogLikelihood and True Mutual Information.
Comparison between mi.pm and user2.pm.
%csdev0544% perl rank.pl mi.txt user2.txt
%Rank correlation coefficient = 0.6364
The Rank correlation coefficient is closer to 1 than 0. Therefore, we can say that
the Loglikelihood Ratio and the Symmetric Lambda measure are opposite of each other. We
should notice though that the Symmetric Lambda measure is closer to Pointwise Mutual I
Information than True Mutual Information is.
OVERALL RECOMMENDATION:

Which is the best measure for identifying significant collocations in large corpora: mi.pm, ll.pm, user2.pm
or tmi.pm?
I think that the best measure for identifying significant collocations in large corpora is the ll.pm module.
The ll.pm module seems to rank interesting collocations higher. For example, looking at a few randomly
selected collocations and their line numbers.
Proper Noun Collocations
QUEEN<>GERTRUDE<> ll.pl  162 mi.pl  23638 tmi.pl  147 user2.pl  2147
KING<>EDWARD<> ll.pl  147 mi.pl  50771 tmi.pl  174 user2.pl  5233
MARK<>ANTONY<> ll.pl  16 mi.pl  19732 tmi.pl  18 user2.pl  826
Significant Collocations
your<>highness<> ll.pl  183 mi.pl  69966 tmi.pl  204 user2.pl  5943
pray<>you<> ll.pl  86 mi.pl  119610 tmi.pl  90 user2.pl  290548
Sir<>John<> ll.pl  89 mi.pl  34664 tmi.pl  94 user2.pl  5162
my<>heart<> ll.pl  114 mi.pl  143689 tmi.pl  111 user2.pl  32081
We can see from the above examples that the ll.pm module gives interesting collocations a higher value.
I chose to look at the line number to justify why I picked Loglikelihood because there are the same number
of bigrams in each file and the line number gives us an idea of where they fall on each of the different
measures ranking system.
It must be noted that the user2.pl module ranks better collocations closer to the bottom than the top. Yet,
even keeping this in mind the ll.pm module identifies the significant collocations better.
EXPERIMENT 2:

This experiment required the implementation of two modules to be used in conjunction with the Ngram Statistic
Package using trigrams. The two modules that were implemented calculate the Loglikelihood Coefficient and the Symmetric
Lambda measure. The modules were named ll3.pm and user3.pm respectively.
Log Likelihood coefficient determines how much more likely one hypothesis is from another. The formula that is used is:
G^2 = Sum( (2 * nij) * log( nij / eij) )
This states how much more likely the known values are from the expected values of the trigram. The formula above corresponds
to the following contingency table:
 w3  !w3 
________________________
   
 w2 n11  n12  n1p
w1 __________________________
   
!w2 n21  n22  n2p
______________________________
   
 w2 n31  n32  n3p
!w1 __________________________
   
!w2 n41  n42  n4p
_______________________ _____
np1  np2  npp
Where nij is the known frequency of a specific cell and eij is the expected value for the corresponding cell.
The calculation of the Loglikelihood Coefficient for the contingency table will give the known probability
of the words compared to the expected probability of the words.
Symmetric Lambda (GoodmanKruskal Lambda) is interpreted as the average of the probable improvement in predicting
the column variable Y given the knowledge of the row variable X and the probable improvement in predicting the row
variable X given the knowledge of the column variable Y. [2] This means that "its value reflects the percentage
reduction in errors in predicting the dependent given knowledge of the independent" [3].
The formula used to calculate lambda is:
lambda = ( rF + cF  Fr  Fr ) / ( (2 * N)  Fr  Fc ), where
rF = Sum of the maximum frequency in each row
cF = Sum of the maximum frequency in each column
Fr = maximum marginal row value
Fc = maximum marginal column value
N = total trigram count
Lambda is "the percent one reduces errors in guessing the value of the dependent variable when
one knows the value of the independent variable. Specifically, lambda is the surplus of errors made when the
marginals of the dependent variable are known, minus the number of errors made when the frequencies of the dependent
variable are known for each value of the independent variable" [3].
"A lambda of 0 indicates that knowing the distribution of the independent variable is of no help in estimating
the value of the dependent variable, compared to estimating the dependent variable solely on the basis of its own
frequency distribution" [3]. Therefore a lambda of 1 indicates that knowing the distribution of the independent
variable helps estimate the value of the dependent variable.
PLEASE NOTE: the experiment was run using the complete works of Shakespeare
found at /home/cs/bthomson/CS8761/nsp/totalworksofshakespeare
TOP 50 COMPARISONS:

Which measure seems better at identifying significant or interesting collocations?
The ll3.pm module and the user3.pm module, which I will call the loglikelihood module and the lambda module
respectively from here on out, returned extremely different results when comparing them. There is not a single
overlap when comparing the top 50 trigrams of each modules output. Meaning combining the top 50 trigrams of each
module there are 100 unique trigrams.
The loglikelihood module returned trigrams that are similar to the below examples:
the<>Duke<>of<> 2 20253.8357 178 51427 502 33153 188 7915 424
the<>name<>of<> 3 19756.8674 121 51427 1383 33153 153 7915 176
the<>Earl<>of<> 4 19304.0744 46 51427 166 33153 46 7915 156
the<>house<>of<>5 19223.6022 72 51427 1159 33153 202 7915 92
49 out of the top 50 trigrams consist of similar trigrams. The only trigram of the 50 that did not look
similar to the examples above was: ",<>my<>lord<>1 26939.4642 1938 181185 22593 4425 4400 2068 2532".
This trigram was not very interesting either.
Lambda indicates the percentage of error reduction in guessing the value of the dependent variable
when knowing the independent variable. This is because when a collocations has a lower frequency value the
lambda will move closer to one while when a collocation has a higher frequency value the lambda moves closer
to zero. Because of this, the lambda module's output did not return any interesting collocations in the top 50
ranks.
In the last 50 bigrams of the lambda module, quite a few interesting collocations appeared for bigrams. I
expected this to be true for trigrams also but it was not. The last 50 trigrams contained punctuation.
For example:
FORMAL<>regular<>,<>
Throca<>movousus<>,<>
Although even with punctuation, there were interesting trigrams. For example:
rigorously<>effused<>,<>
Perfect<>chrysolite<>,<>
affectedly<>Enswathed<>,<>
But as seen, the trigrams end with a punctuation.
How would you characterized the top 50 trigrams found by each module?
I would characterize the loglikelihood module's top 50 trigrams as consisting mostly of uninteresting trigrams.
The trigrams all consist of "the of".
I would characterize the lambda module's top 50 trigrams as very poor also.
Is one of these measures significantly 'better' or 'worse' than the other? Why?
I don't think one of the models is better or worse than the other. Neither return outstanding collocations.
Even taking into consideration that the lambda module ranks better collocations closer to the bottom than the top,
the collocations are not note worthy.
CUTOFF POINT:

Are there 'natural' cutoff points for scores, where trigrams above this value appear to be interesting or significant
while those below do not?
The loglikelihood module's trigrams that consist of "the of" end at rank 586. They then continue
on with "I not" trigrams, then move to "The of" trigrams. Not very interesting. The
trigrams following follow a similar pattern. The trigrams come in blocks. Each block looking similar. For
example, the "I you" block and the ", ." block. Some trigrams seem to look like
small sentences more than collocations. For example, "YOU<>LIKE<>IT<>". Other trigrams look like
prepositional phrases. For example, "more<>mirth<>than<>" and "more<>faithful<>than<>" which may be
interesting but I don't believe that they are collocations. This may have to do with the data that I
am using. Most sentences in Shakespeare's works are short.
The lambda module's trigrams do not have a natural cut off either. The top 100 trigrams were not very good
collocations: For example:
,<>these<>encounterers<>
mischievous<>UNHATCHED<>undisclosed<>
The last 100 trigrams were not very good either which is not what I expected. They consisted of punctuation.
For example:
.<>CXLV<>.<>
,<>Tombless<>,
Not very useful nor interesting.
One other point to make about the lambda module. For trigrams, I came up with negative values. This measure
is suppose to be available to use for trigram data. I am not certain why the negative values showed. There
does not seem to be anything wrong with the implementation. I did not get negative values until I ran
my program with a very large corpora. The data worked perfectly on smaller data sets. The results stayed
between 1 and 1. The trigrams that recieved a score of 1 were similar to the trigrams that recieved a score
of 1. Both sets contained trigrams with punctuation. I did not read in any of the literature that the absolute
value of the score must be taken but this is one possiblity that I looked into.
OVERALL RECOMMENDATION:

Which is the best measure for identifying significant collocations in large corpora: ll3.pm or user3.pm?
In my opinion, ll3.pm is the best measure to use. Neither measure identified interesting or significant
collocations. This may have to do with the data that I was experimenting with but I did not see any grouping
of good collocations. They seemed to be dispersed throughout the tables. I believe that ll3.pm is the
better measure due to the negative numbers I was getting for the user3.pm module.
SOURCES:

[1] www.engineering.usu.edu/classes/ece/7680/lecture2/node3.html
[2] www.zi.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap28/sect20.htm
[3] ww2.chass.ncsu.edu/garson/pa765/assocnominal.htm
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Suchitra Goopy
10/11/2002
Natural Language Processing

Assignment 2
The Great Collocation Hunt

INTRODUCTION:

"A COLLOCATION is an expression consisting of two or more words that
correspond to some conventional way of saying things"
Foundations of Statistical Natural Language Processing
(Christopher D.Manning and Hinrich Schutze)
This experiment was performed with the aim of finding a test of association
that can be implemented both for bigrams and trigrams.The nsp package was
implemented by "Satanjeev Banerjee" and has many interesting measures of
association for bigrams.Though nsp has been extended to accomodate ngrams,
a suitable measure of association(for nograms) is yet to be implemented.
The test of association that I have used in this assignment is the "Goodman
and Kruskal's Gamma".This test was taken from "Handbook of Parametric and
NonParametric Statistical Procedures" by David J.Sheskin.A test of association
is usually performed to see if the variables involved in the test are
dependent on each other,or are independent.Many tests can also be used to
determine how strong or how weak the association between the two variables is.
Though I learnt that the "Goodman and Kruskal's Gamma" can be extended to
accomodate trigrams,I have not found any supporting documents or references
for this.But I have extended it to perform for trigrams as well.
"The Goodman and Kruskal's Gamma is a signed test that measures both the
strength and direction of the relationship,the sign indicating whether a
relationship is positive or negative"
 Taken from a website,the url is given below
http://216.239.53.100/search?q=cache:VBkDxI8tR70C:www.bus.ed.ac.uk/cfm/cfmr988.pdf+extending+goodman+and+kruskal%27s+gamma++for+3+variables&hl=en&ie=UTF8
EXPERIMENT 1:

The two main modules that had to be implemented for performing Experiment 1
were tmi.pm and user2.pm.The module "tmi.pm" was implemented to calculate
"True Mutual Information",while user2.pm was implemented for bigrams using
the Goodman and Kruskal's Gamma.
TOP 50 COMPARISONS

First a comparison was made between the outputs of tmi.pm and user2.pm
I found that user2 had more interesting collocations than tmi.Though there
were a number of collocations in the top 50 of user2 that did not make much
sense,a few of the interesting collocations I found were :
picked<>up<>1 1.0000 1 1 385
stark<>naked<>1 1.0000 2 2 15
complied<>with<>1 1.0000 1 1 1071
split<>open<>5 0.9996 1 2 32
laboured<>hard<>15 0.9986 1 3 51
banished<>from<>32 0.9969 2 3 443
The above words usually tend to occur together and were also highly ranked
by user2.If the value is closer to 1 then it indicates a very strong positive
relationship.As we went down the ranks some of the following collocations I
found were:
shore<>would<>6397 0.0395 1 269 482
water<>a<>6842 0.0971 2 148 2296
cannibal<>coast<>19 0.9982 1 4 43
enough<>in<>6940 0.1348 1 98 1871
These were not good collocations and were ranked low by user2.Also some
of them have a negative sign that indicates that the words have a negative
relationship with each other.
Tmi on the other hand had ranked the following to be very high.Though "of the"
tends to occur together very often, or rather most of the time,it does not
really form a collocation,we do not have enough evidence to say it forms a
collocation.(It does not make much sense by itself)
of<>the<>4 0.0086 817 3558 5854
Some of the good collocations in tmi were ranked down the list.Like the ones
given below:
cried<>out<>48 0.0007 13 18 423
cast<>away<>49 0.0006 12 32 147
rainy<>season<>50 0.0005 7 14 31
I feel that "dice.pm" would have performed better than the above two tests.
This is because it has more information when it ranks the bigrams and hence
does a better job.
CUTOFF POINT

On looking at the output for tmi I did not find any cutoff point above
which the collocations were interesting.There were some vague and funny
collocations even in the first few ranks.Sometimes I found some interesting
collocations towards the middle,say around the 50th or 60th ranks.
The output for user2 had good collocations in the first few ranks.Though
I looked for a cutoff point I was not able to really find one.It was true
that as the ranks decreased,the interesting collocations decreased as well,
but just as I thought that I had found a cutoff point I would see a few good
collocations,and would have to rethink the cutoff point.
RANK COMPARISON

I used rank.pl to compare the output produced by ll.pm and tmi.pm
rank.pl ll.out tmi.out
I got a rank correlation coefficient of 0.0338
This indicates that the two rankings are reversed.
I compared ll.pm and user2.
rank.pl ll.out user2.out 0.7117
This shows that they are ranked in almost the same way.
rank.pl mi.out tmi.out 0.1136
A comparison between mi and tmi shows that they are reverse ranked.
mi.out user2.out 0.9229
This shows that mi and user2 are ranked in almost the same way.
Each measure is different in that they extract or rather,capture different
elements.The tests of association are varied and differ in what they consider,
to perform their calculatations or carry out the test.For example,if a
measure uses frequency values in it's test then prepositions will occur more
frequently and hence will be ranked higher,but that does not mean that they
form collocations.
OVERALL RECOMMENDATION

On comparing the 4 modules I found that mi.pm had very good results.Many
collocations were interesting,like the ones below
pro<>cessing<>1 17.0977 1 1 1
tax<>deductible<>2 16.0977 1 2 1
United<>States<>2 16.0977 2 2 2
After mi.pm I would think that user2.pm performed very well.This is because
some good collocations were ranked very high,and as the rank decreased most
of the bigrams did not make much sense.
Then tmi.pm performed well compared to ll.pm.It seemed to give higher ranks
to better collocations.I came to this conclusion beacuse ll.pm did not seem
to do the ranking as effective as tmi.pm.For instance,here is an example
where the same bigram was ranked differently.I noticed this trend for a
number of them
ll.pm : cast<>away<>335 123.3642 12 32 147 ll
tmi.pm :cast<>away<>49 0.0006 12 32 147

EXPERIMENT 2:

The two main modules that had to be implemented for experiment 2 were
ll3.pm and user3.pm.ll3.pm was the Log Likelihood Ratio for trigrams.
user3.pm was the same test of association used for bigrams,but had to be
extended for trigrams.
TOP 50 COMPARISONS

When I compared the output produced by ll3.pm and user3.pm,I noticed that
most of the trigrams(atleast in ll3) had atleast one or two punctuation marks
like , or ; .Then I tried to eliminate the formation of these trigrams by
using the stop*** option provided in nsp.But this option allows you to
remove the trigram only if it is made up entirely of all the stop words in
the STOP file.This did pose some problems when trying to find and analyse the
interesting trigrams.Probably,if I had been able to eliminate the punctuation
in the trigrams,more meaningful collocations could have been formed.
user3.pm provided more interesting trigrams than ll3.pm.As said above
ll3.pm had a number of trigrams with punctuation marks in it in the first
100's of ranks.The following were some of the interesting trigrams that I
noticed in the output produced by user3.pm
services<>for<>which<>1 1.0000 1 1 1320 889 1 1 8
Thank<>God<>!<>1 1.0000 1 1 159 92 1 1 3
uneven<>state<>of<>1 1.0000 1 1 22 3558 1 1 13
circumstance<>presented<>itself<>13 0.9988 1 3 12 31 1 1 2
Please<>note<>:<>50 0.9951 1 2 6 337 1 1 1
ll3.pm on the other hand had a lot of punctuation marks in the trigrams,
and most of the times it wasnt a good collocation.Though some words
tend to occur together as they did,they do not necessarily form collocations.
CUTOFF POINT

I did not find any interesting cutoff point above which there were unusual
or good collocations.There were a number of good collocations in the beginning
and later too you could find some collocations that were really good,
sometimes better than the higher ranked ones.I found this to be true for both
ll3.pm and user3.pm
RANK COMPARISON

I used rank.pl to compare ll3.pm and user3.pm.The rank correlation
coefficient I obtained was
ll3.out user3.out 0.2018
This shows that the two tests have a completely reversed ranking.
The two tests differ in that they appear to capture different elements
and rank them accordingly.Log likelihood is a very popular test of measure
when compared to "Goodman and Kruskal's Gamma".But in many ways I felt
that the performance of user3 was much better.
OVERALL RECOMMENDATION:

Though I have not found proof to support the test of association in user3.pm
I would think that it is working fairly well.I found some trigams that formed
very good collocations.Also I feel that if there was a mechanism to eliminate
the punctuation marks from occuring in the trigrams a better analysis of the
experiment could have been carried out.
CONCLUSION

I did notice one thing in the experiments.A strong association between the
words does not necessarily mean they form a collocation.They could have
a high ranking because they probably appeared a number of times in the text
and hence were put together,but they would definitely not form a collocation.
This experiment also helped me to perform a lot of research in trying to
find a test of association.I also spent a considerable time thinking
about the trigrams and bigrams and trying to see if they formed meaningful
collocations.
***The stop option was provided by creating a file STOP.txt
This File had ; , etc each on a different line
Then the stop option was used on the command line when executing the
program++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Paul Gordon
CS8761
1913768
10/10/02
Assignment 2  The great Collocation Hunt
Introduction:
The great collocation hunt is an experiment in using statistical measures of association to rank and categorize collocations. The intention is to rank the most "interesting" collocations first, followed by more mundane, or less interesting collocations. In this experiment, four tests of association were used with bigrams, and the results compared with each other. Two of these tests, are part of the nsp package, mi.pm, which measures pointwise mutual information, and ll.pm, which measures the loglikelihood. Two of the measures were implemented by the experimenter, tmi.pm, which measures true mutual information, and user2.pm, which measures the odds ratio. Also, two tests of association were used with trigrams, ll3.pm, which measures loglikelihood, and user3.pm, which measures the odds ratio for trigrams. The Corpus used for the experiments was the combination of several Project Gutenberg files of the following texts: Crime and Punishment, Of Human Bondage, Moby Dick, and The complete King James Bible.
BIGRAMS: user2.pm, tmi.pm, mi.pm and ll.pm
TOP 50 COMPARISON
The original implementation of True Mutual Information (tmi.pm) had only 49 ranks for a corpus of nearly 2 million tokens. This was a problem. Apparently TMI can become quite small with large corpora. As a result, I decided to multiply my result by a constant factor. This wouldn't affect the ordering, other than to make it a finer grained measure. The alternative, modifying the number of significant figures reported by statistic.pl was not an option.
User2.pm, which measures the Odds Ratio, did well at finding proper nouns. It is not until number 21 with "categorical imperative", that the collocation is something other than a proper noun ( although "PUBLIC DOMAIN" is not really a proper noun ). The top ten result follow:
Pulcheria<>Alexandrovna<>1 894532483.0000 123 123 123
Avdotya<>Romanovna<>2 836590755.0000 115 115 115
Moby<>Dick<>3 633790675.0000 87 87 87
Svidriga<>lov<>4 214698175.0000 207 210 207
Marfa<>Petrovna<>5 189534429.6667 78 78 79
Nikodim<>Fomitch<>6 177467563.0000 24 24 24
PROJECT<>GUTENBERG<>7 119519499.0000 16 16 16
El<>Greco<>8 97788843.0000 13 13 13
Praskovya<>Pavlovna<>9 68814523.0000 9 9 9
Sag<>Harbor<>10 61570923.0000 8 8 8
This particular measure scores high when the bigram parts are only associated with that bigram  notice for the top three, the first and last words never occur as first and last words in any other bigrams. Just beyond "categorical imperative there are several interesting collocation, including "hoky poky", "Pell mell" and "web browser".
In contrast, there are virtually no proper nouns in the top 50 of the true mutual information measure, Lord and Israel being the exceptions. The top ten results from the mutual information measure are shown below. It appears that in order to rank high, not only does the bigram have to appear often, but each individual word has to appear often as well. This makes sense considering that tmi is summed over all squares of the contingency table. Notice that the tmi is much larger than it should be. This is because of the multiplying factor mentioned above. To get the actual value, divide by 1000000000.
,<>and<>1 43006180.2526 32702 119732 59360
of<>the<>2 13431457.2323 15222 50674 94305
the<>LORD<>3 12450162.9418 5962 94305 6648
.<>He<>4 9075845.1419 4181 65625 5246
in<>the<>5 7285833.3391 7752 23359 94305
shall<>be<>6 6198847.1158 2526 10154 10124
;<>and<>7 5242192.1121 4755 17562 59360
don<>t<>8 5106985.6083 972 973 2957
I<>will<>9 4941218.7702 2055 18788 4891
I<>am<>10 4360132.2218 1323 18788 1537
CUTOFF POINT
There does not appear to be any good cutoff point for the true mutual information measure. But then, that is because so many of the top ranking collocations don't appear to be very interesting, just common. There are many connecting words in the collection. And there isn't much to distinguish the underlying corpus.
There appears to be a cutoff area, rather than a cutoff point for the odds ratio. All bigrams whose constituents are only a component of that bigram, and which only appears once ( 1, 1, 1 ) are ranked 23. Between there and rank 40, the bigrams appear to go from interesting to unusual. At rank 40, one of the bigrams is "beneficent publicity". This strikes me more as an unusual phrase rather than a collocation. There are many of these by rank 40. This measure picks up bigrams whose components appear together a lot relative to the amount they appear within other bigrams. So at first the measure picks of interesting pairs, but further down the rank, they lose their interesting qualities.
RANK COMPARISON
rank ll to user2 : 0.6858
rank ll to tmi : 0.9984
rank mi to user2 : 0.9949
rank mi to tmi : 0.7049
True mutual information is more like the loglikelihood measure and user2 is more like pointwise mutual information. loglikelihood and true mutual information probably have a high correlation because they have a similar method of summing logs over each square of the contingency table. It is less clear to me why there is such a high correlation between the pointwise mutual information, and the odds ratio. I cannot see a relationship between the two fractions used in these measures. What is more, the pointwise mutual information calculation take the log of the fraction. In general, though, neither measure sums over each square of the contingency table, and so should be more like each other than they are like the other two measures.
OVERALL RECOMMENDATION
Based on the above experiments, it appears that the odds ratio does a much better job of giving interesting collocations a high rank. However, it also seems clear that the odds ratio misses interesting collocations with common words. As the ranking results suggest, the loglikelihood measure is very similar to that of true mutual information. In fact, many of the same bigrams appear in the top 50 of each. As such, though, it suffers a similar problem, and that is that the "interesting" collocations are buried under the common ones. Pointwise mutual information is not as closely similar to the odds ratio, as tmi is to the loglikelihood. It is, rather, similar in style. It also seems to rank highly, bigrams whose components don't make up other bigrams, but it is not as extreme as the odds ratio. After examining the ranking each of these measures gave the bigrams in the corpus, I consider the pointwise mutual information to have the best list of interesting bigrams.
TRIGRAMS: user3.pm and ll3.pm
TOP 50 COMPARISON
As with the bigram form of the odds ratio, the trigram model picks up a lot of names. Now, however, it is attaching whatever the preceding word is with it. So nearly the whole top fifty words are one of two names preceded with another word. Obviously this is not very useful. Regarding the trigram loglikelihood, it is also much as it was in the bigram model. It ranks highly very common token combinations like "and the ,". Although the trigrams definitely give more of a sense of the underlying text, especially from the bibles contribution. It turns out that the bible was an unfortunate choice for inclusion, because of the chapterverse markers which were noticeable with the bigrams, but are now very pronounced.
CUTOFF POINT
Again there did not appear to be any clear cutoff for the loglikelihood measure. Indeed, it seemed as though the quality of the bigrams improved with depth. But even at ranks above 4,000, they don't appear to me to be very interesting. The Odds ratio statistic ranking has a somewhat strange cutoff. After about rank 500, the names start to drop off, and interesting collocations begin to appear. I suppose this is a result of the this measures affinity to rank highly, proper nouns.
OVERALL RECOMMENDATION
Based on the results of both trigram statistics, my overall recommendation is for the odds ratio measure. It, in my estimation, has many more interesting trigrams. It would be especially good if there was a way to filter out the repetition of proper nouns. To me, the log likelihood ranks seem just as uninteresting at the beginning, as further down. Actually, somewhere in the middle, there appear to be what I would call interesting bigrams. For example the following is take from around the 400,000 rank
is<>life<>;<>404855 189.2558 1 10385 1067 17562 11 211 54
these<>cases<>,<>404856 189.2550 3 1602 64 119732 4 229 23
were<>consumed<>with<>404857 189.2540 1 5305 108 12113 7 118 11
the<>letter<>suddenly<>404858 189.2532 1 94305 228 446 70 3 1
eyes<>were<>somehow<>404859 189.2528 1 1203 5305 83 49 1 4
while<>I<>live<>404860 189.2527 1 676 18788 441 37 2 43
near<>here<>,<>404861 189.2510 2 350 883 119732 3 30 177
be<>set<>for<>404862 189.2469 2 10124 912 12320 23 193 8
justice<>take<>hold<>404863 189.245
"while I live" and "justice take hold" seem to me to be good candidates for "interesting" trigrams.
CONCLUDING THOUGHTS
Conducting the experiments for this assignment has gotten me to think more deeply about what a makes a bigram a collocation. In one sense, a collocation is a bigram where the likelihood of the two words occurring together is less than what is actually observed. That is, there is some statistical significance. But part of this assignment was to assess how well the measures did, which is a subject criteria. In the Manning and Schutze chapter on collocations, they were interested in fairly common words such as "strong tea" and "powerful computers." Yet the results I got from the statistical tests seemed to rank high either very general word combinations like "is a" or fairly uncommon pairs like "El Greco."
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Name: Prashant Jain
CS8761 Assignment #2:The Great Collocation Hunt
Introduction

This assignment involved finding a suitable measure of association that can identify collocationswhich are defined in the textbook as "an expression consisting of two or more words that correspond to some conventional way of saying things." in a large corpora of text. I had to implement a measure that can be used with bigrams and trigrams and then compare that measure with other measures that have been implemented in the Ngram Statistics Package.
There are many different ways of measuring association amongst collocations(of different lengths, say 2 or 3).There are several methods that have been implemented in the Ngrams Statistics Package already. These are, the Pearson's Chi Squared Test, Log Likelihood Ratio, Dice Coefficient, Pointwise Mutual Information and the Left Fisher's test. I had to implement 'True' Mutual information test for collocations of length 2 and find and implement another suitable test to identify collocations(which had to be extended and implemented for collocations of length 3). I also had to implement log likelihood ratio for collocations of length 3.I then had to compare the results from the various measures I have implemented.
Measure Used and Implemented

The measure that I found and used in my experiments is a variation of the crossproduct ratio called the Yule's Coefficient of Colligation. In the cross product ratio, we get the measure of association 'alpha' which is the ratio of cross products in the first row to that in the second and is calculated thus:
alpha = (xy * xbarybar) / (xybar * ybarx)
When there is no association present, alpha = 1 otherwise alpha has a value either greater to 1 or less than 1, depending on the degree of association. We can also generalize the crossproduct ratio to more than 2 dimensions(useful to us for extending the test to trigrams).For three dimensions, the cross product ratio is calculated like this:
alpha = (xyz * xybarzbar * xbaryzbar * xbarybarz) / (xyzbar * xybarz * xbaryz * xbarybarzbar)
I didnt however use this value of alpha because of the unbounded nature of alpha. Hence, I used a transformation of alpha which restricts alpha's range to a finite interval. Coefficient of Colligation bounds the value of alpha between 1 and 1. It is calculated thus:
Y = (sqrt(alpha)  1) / (sqrt(alpha) + 1)
Text Used

I used a corpus of over 1 million words taken from the Brown Corpus that had been made available to us at /home/cs/tpederse/CS8761/BROWN1. I created a file called CORPUS (available at /home/cs/jain0069/CS8761/nsp), which I created by concatenating the data present in all the files in the BROWN1 directory.
Experiment 1

I had to implement 'True' mutual information for collocations of length 2. I used the following formula to implement true mutual information:
I(X;Y) = summation for all values of X and Y(Joint Frequency(x,y) * log(Joint Frequency(x,y)/(Frequency(x)*Frequency(y))))
I created and implemented this in a file called tmi.pm . I then ran the program count.pl of the Ngram Statistics Package to extract the bigrams from the corpus.After that I ran the statistic.pl program on the output file produced by count.pl using the tmi.pm file. The results were stored in an output file called outtmi.txt.
At the Command Line:
./count.pl testoutput.txt CORPUS
./statistic.pl tmi.pm outtmi.txt testoutput.txt
I then implemented the 2 length collocation version of Coefficient of Colligation in a file called user2.pm .I repeated the procedure followed earlier and got the final result in an output file called outuser2.txt .
At the Command Line:
./statistic.pl user2.pm outuser2.txt testoutput.txt
Comparison of the Top 50 Ranks between tmi.pm and user2.pm

The total number of bigrams that were extracted from the CORPUS were 1,336,151. I give the snapshot of the top ten values extracted by both tmi.pm and user2.pm in the table below.
TMI Table

Collocation Extracted Rank TMI Value XY Value X Value Y Value
. The 1 0.0135 4961 49673 6921
of the 2 0.0102 9181 36043 62690
. He 3 0.0071 2421 49673 2991
in the 4 0.0062 5330 19581 62690
. It 5 0.0049 1718 49673 2184
, and 5 0.0049 5530 58974 27952
. In 6 0.0037 1320 49673 1736
, but 7 0.0034 1648 58974 3006
. But 8 0.0033 1115 49673 1371
the the 8 0.0033 3 62690 62690
USER2 Table

Collocation Extracted Rank Coeff. of Coll. Value XY Value X Value Y Value
Export Import 1 0.9995 14 15 15
Yugoslav Claims 2 0.9993 7 8 8
PROVIDENCE PLANTATIONS 2 0.9993 6 7 7
Push Pull 3 0.9992 9 11 10
Dolce Vita 4 0.9991 4 5 5
Tar Heel 4 0.9991 8 10 9
Taft Hartley 5 0.9990 3 4 4
Baton Rouge 5 0.9990 3 4 4
Lo Shu 5 0.9990 17 20 19
Grands Crus 6 0.9988 2 3 3
I notice that amongst the top 50 ranked collocations found by tmi.pm and user2.pm, the ones that have been found by user2.pm seem to be much more interesting that the one that has been found by tmi.pm . As can be seen by the cross section of the top ten values given in the user2 table, the collocations that have been ranked near the top, seem to be occurring 'together' more frequently 'whenever' they are appearing in the corpus. For example, the top ranked collocation for USER2 is 'Export Import'. The words Export and Import occur together 14 times in the corpus and only once do they appear independently. As we look down the table, indeed the file, we see a similar pattern. The collocations ranked occur together in a higher proportion to their total occurance in the entire corpus.The collocations(if they can be called collocations!?) extracted by the TMI test appear to be mostly noninteresting. I think the reason behind this is that TMI goes for finding higher frequency words of any kinds which means that even uninteresting words like 'the', 'in', 'but' etc. are also incorporated in its pool of collocations.
CutOff Point

I haven't been able to notice any particular cutoff point in the output of tmi.pm . However, in the output of user2.pm I noticed something to the effect of a cutoff point. If we look at the collocations up to around rank 60, we notice that most of them are interesting. But as we go down the file, the number of interesting collocations keep on reducing. They reduce pretty significantly after around that rank. Hence we can say that the rank 60 is the cutoff point as far as user2.pm is concerned.
Note: I need to mention here that more than one collocation has the same rank so if I say up to rank 60, it doesn't mean, just 60 collocations.
Rank Comparison

I first compared the output of loglikelihood ratio for two word collocations with the result of 'true' mutual information. On running rank.pl:
Rank Correlation Coefficient = 0.9667
This signifies that the two tests are absolutely not similar to each other in the output they produce.I then compared the output of PointWise Mutual information for two word collocations with the results of 'true' mutual information. On running rank.pl:
Rank Correlation Coefficient = 0.9667
Again we see, that the Rank Correlation Coefficient is very near to 1 which signifies that the test are absolutely dissimilar to each other in the collocations they produce according to the ranks.
Now running the rank.pl to compare user2.pm with loglikelihood ratio, we get:
Rank Correlation Coefficient = 0.6180
This result is significant as it tells us that my test, implemented in user2.pm is much closer in its output to the loglikelihood ratio output that the true mutual information test. Now we compare user2.pm with PointWise Mutual information. On running rank.pl, we get:
Rank Correlation Coefficient = 0.7605
Here we notice that the Rank Correlation Coefficient is even closer to 1. Hence we can conclude that the output's of Point Wise Mutual Information and user2 are even more close to each other than those of log likelihood and user2. Hence we can conclude that while TMI is not like any of the other two, Coefficient of Colligation gives pretty similar output to loglikelihood test and even more similar output to PMI.
Overall Recommendation

Choosing between TMI and Coefficient of Colligation(CoC), we can easily conclude that it would be better to use CoC, since we have seen better results given by CoC almost everywhere. We saw the rank comparison and the collocations given by both the tests. Choosing between CoC, Log Likelihood and MI would however be a difficult job. We saw the first ten values in TMI and CoC output a little bit earlier. Now lets look at the first ten values in the output of Log Likelihood and MI and may be that would help us decide, which indeed is a better measure at finding collocations. The snapshot table of Log Likelihood is as follows:
Log Likelihood

Collocation Extracted Rank Log L.Value XY Value X Value Y Value
. The 1 25043.4196 4961 49673 6921
of the 2 18856.1038 9181 36043 62690
. He 3 13183.4486 2421 49673 2991
in the 4 11393.6840 5330 19581 62690
. It 5 9139.3156 1718 49673 2184
, and 6 9072.9347 5530 58974 27952
. In 7 6844.0270 1320 49673 1736
, but 8 6309.5959 1648 58974 3006
the the 9 6126.7892 3 62690 62690
. But 10 6064.5057 1115 49673 1371
PointWise Mutual Information

Collocation Extracted Rank MI Value XY Value X Value Y Value
Burly leathered 1 20.3497 1 1 1
Capetown coloreds 1 20.3497 1 1 1
Bietnar Haaek 1 20.3497 1 1 1
aber keine 1 20.3497 1 1 1
Despina Messinesi 1 20.3497 1 1 1
Thoroughbred Racing 1 20.3497 1 1 1
Josephus Danielswas 1 20.3497 1 1 1
POLAND _FRONTIERS 1 20.3497 1 1 1
cha chas 1 20.3497 1 1 1
Recherche Scientifique 1 20.3497 1 1 1
As we can notice in the tables, the log likelihood is giving disappointing values for its top ten ranked collocations. As in the True Mutual Information, we dont see anything that might be classified as interesting. Hence we cant claim it to be performing better than user2. However, when we look at the data for Point Wise Mutual Information, we see that there are some better results. If we keep looking down the table, we see collocations like
TRAVEL CLUB 6 18.0277 1 1 5
These are interesting collocations that occur very frequently using this test. But, the one problem noticed by me in this test is that most of the collocations searched by this test a cluster of the collocations have the same MI value and the same rank. Hence if we are going to look at only the rank to determine which are the interesting collocations we would definitely have a difficult time because there are just too many collocations with the same rank and to bring out a few interesting collocations is going to be almost impossible.
Hence, I would say that using user2 to calculate 2 word collocations would be the best thing to do in this case, according to the test results.
Experiment 2

I had to implement the log likelihood ratio for collocations of length 3 in this experiment. The formula that I use for that is:
G(X,Y,Z) = summation for all values of X,Y and Z(Frequency(x,y,z) * log(Frequency(x,y,z)/Expected Value))
The expected value is calculated thus:
E(X,Y,Z) = (Frequency(X) * Frequency(Y) * Frequency(Z)) / (Total Trigrams * Total Trigrams)
I created and implemented this in a file called ll3.pm . I then ran the program count.pl of the Ngram Statistics Package to extract the trigrams from the corpus.After that I ran the statistic.pl program on the output file produced by count.pl using the ll3.pm file. The results were stored in an output file called outll3.txt.
At the Command Line:
./count.pl outputtri.txt CORPUS
./statistic.pl ll3.pm outll3.txt outtri.txt
I then implemented the 3 length collocation version of Coefficient of Colligation in a file called user3.pm .I repeated the procedure followed earlier and got the final result in an output file called outuser3.txt .
At the Command Line:
./statistic.pl user3.pm outuser3.txt outtri.txt
Comparisons of Top 50 Ranks between ll3.pm and user3.pm

The total number of trigrams extracted from the corpus were 1,336,150.Again, I give snapshots of the top 10 values that I got from the output of both ll3.pm and user3.pm. These are:
Log Likelihood Ratio (for Collocations of length 3)

Collocation Extracted Rank TMI Value XYZ Value X Value Y Value Z Value XY Value XZ Value YZ Value
N01 0680 chair 1 89726.2127 1 166 500 66 1 1 1
A20 0710 Britain 2 89663.1952 1 187 497 61 1 2 1
a belligerent motion 3 89653.3366 1 21998 5 59 1 3 1
J03 0151 be 4 87199.4986 1 176 2 6340 1 4 1
worn with crimson 5 87197.4879 1 26 7007 8 1 1 1
new culture ; 6 87184.0217 1 1045 58 2753 1 6 1
,Selena said 7 86914.4145 1 58974 4 1950 1 614 1
conditions prevail ; 8 85214.6164 1 178 7 2753 1 1 1
enormous installationsat9 85214.4259 1 37 16 4966 1 1 1
source F17 1680 10 85211.4453 1 90 177 458 1 1 1
USER 3 TABLE

Collocation Extracted Rank TMI Value XYZ Value X Value Y Value Z Value XY Value XZ Value YZ Value
9 a the 1 0.9937 1 95 21998 62690 2 2 2
parts a and 2 0.9847 2 109 21998 27952 3 3 8
thyroid . . 3 0.9845 1 38 49673 49674 3 3 2
of will she 4 0.9842 1 36043 2201 2087 2 9 2
as , One 5 0.9834 1 6691 58974 417 10 2 2
be you also 6 0.9833 1 6340 2979 997 2 2 2
to and including 7 0.9828 2 25725 27952 170 16 3 3
for or against 8 0.9813 2 8833 4127 618 3 3 5
on or before 9 0.9810 7 6405 4127 947 11 10 8
may or may 10 0.9802 10 1286 4127 1286 11 11 15
The two sets of data, when examined, do not reveal too much of dissimilarity, in extracting collocations of length 3. Atleast on the face value it looks like that.We don't see too many interesting collocations in the top ten values, as shown in the table. However, there are other values that can be considered of interest scattered over the top few ranks. For example in outuser3.txt we have the following at rank 14:
As of now 14 0.9775 1 540 36043 1037 3 2 2
This seems to be an interesting collocation, present a bit higher up in the file. But as we go down, we see that the quality of the collocations extracted keeps on deteriorating till we find that no interesting collocations are being extracted by the user3.pm test. An interesting thing that i noticed in the log likelihood ratio test results were the unusually high frequency of collocations, either begining with a period(.) or having a period somewhere in between. For example:
. The oilheating 64 78973.6002 1 49673 6921 2 4961 1 1
. The raped 65 78567.5017 1 49673 6921 2 4961 1 1
apple . The 66 78516.2169 1 8 49673 6921 1 1 4961
. The lordly 67 78508.7386 1 49673 6921 2 4961 1 1
As we can see, four consecutive ranks have the period in it. Infact, the next 10 ranks were the same.
Cut Off Point

As I said, the collocations that are extracted by user3.pm are not very interesting. However, do they show any trends or cut off points? It does seem so. Again, if we got beyond the ranks 5560, we see a decided decrease in the interesting nature of the collocation. Hence we can say that the cut off point for user 3 is around the rank 55. We do notice a certain trend in the log likelihood data as well. We see that beyond the rank 35, the collocations mostly have uninteresting charecteristics. There are obviously a few collocations that exist and are interesting, but they are lost in a maze of too many uninteresting collocations.
Overall Recommendation

Here we have to choose only between the two tests, log likelihood and the Coefficient of Colligation test. Though the CoC test performs much better at two word collocations, when it comes to three word collocations, each of them produce equally uninteresting results. Especially, log likelihood with its excessive periods, doesnt leave any room for getting some interesting collocations. So I would say that neither of the tests can be really considered as being really better than each other. Each of them perform below par for the given set of data.
Conclusion

To conclude I would like to say that the measure that I found out and used to find interesting collocations in large corpora of text, gave mixed results. Giving very good results at finding 2 word collocations but giving below average results when searching for interesting 3 word collocations. We did however learn that even the log likelihood ratio for 3 word collocations doesnt actually give too good a results. So we can conclude by saying that we still need to find a decent method for extracting 3 word collocations from large corpora of text because the ones that were analyzed here do not seem to be doing there job up to par.
References

Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schutze. MIT Press.
Multiway Contingency Table Analysis for Social Sciences by Thomas D. Wickens
http://www2.chass.ncsu.edu/garson/pa765/assoc2x2.htm
http://www.spss.com/tech/stat/algorithms/11.0/cluster.pdf
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Name : Rashmi Vinay Kankaria
Assignement no 2
Objectives : To find a measure of association and implement the same for bigrams and trigrams.Also find tmi for given bigrams and log likelihood ratio foe trigrams.
Experiment 1 :
First Module : tmi.pm
This module is implemented to find true mutual information of given bigrams
true mutual information can be formulated as ,
I(X:Y) = sum (over x, y) [ P(X,Y) * log ( P(X,Y)/ P(X)P(Y) ) ]
1629547
.<>The<>1 19721.9494 5593 65255 7800
of<>the<>2 16166.1332 10710 42133 74521
,<>and<>3 12077.0140 8584 77904 34436
.<>He<>4 11447.6870 2986 65255 3671
in<>the<>5 10115.0110 6403 23448 74521
.<>It<>6 9809.9502 2555 65255 3130
,<>but<>7 6486.1348 2362 77904 4111
.<>I<>8 5816.9501 3392 65255 12556
.<>But<>9 5762.9153 1506 65255 1846
to<>be<>10 5444.1418 2067 32004 7847
.<>In<>11 5293.2955 1458 65255 1918
,<>,<>12 5186.6391 61 77904 77904
the<>the<>13 5118.9997 3 74521 74521
had<>been<>14 4815.7797 956 6773 3290
on<>the<>15 4557.6574 2441 7241 74521
have<>been<>16 4459.8949 875 5882 3290
This is the output file of the tmi.pm.The bigram ". The" has occured maximum number of times and so has rank one. the mutual information is given as 19721.9494.
Of 1629547 number of bigrams, 5593 are ", the" well this bigram has not occured maximum number of times( as you one can seee the frequency of "of the" 10710 however
still this has rank one as the mutual information is mroe in this . this means that these 2 words "." and " The" tend to occur more together than another words
occuring together individually or otherwise.
ll and tmi : 0.9932
Second Module : ll3.pm
This module is implemented to perform the log likelihood ratio test for trigrams with the assumption of Null Hypothesis P(x,y,z) = p(x)p(y)p(z)
The ratio is given by the following formula :
G2 = 2 summation (over x,y) [ (oij) * log ( oij / eij )]
where i < x,j < y
oij = obsereved value (in ith row and jth column)
eij = expected value (in ith row and jth column)
Log likelihood for trigrams can be intuitively visualised as a cuboid with 3 coordinate axes x,y,z . thus the expexted frequency of any sample will be dependant
not just on its total occurance in x and y place but also in z plane. ie "in this case", the frequency of of is not jst calculated wrt to this but also with respect to "case" and " this case".
Third Module and Fourth Module : user2.pm and user3.pm
This module implements a measure of association called Log Odds Ratio for bigrams.
for 2 * 2 matrix,
y y'
x a b
x' c d
the value of a/b gives odds y v/s y' of y for x
while value c/d gives odds y v/s y' of y for x'
now the ratio of these 2 odds will give us how much the odds of x increases for y given the value of y for x' ( thus y ' making a reference point : intuitively)
In simple words, suppose that A can be 70 % yes and 30% no then the chances of a beinf yes is 70/30 times that of no.
also suppose B can be 60% yes and 40 % no then the odds here is 60/40 which menas that this many times B can be yes given the chance of no to be 40%.
now the odds ratio for A and B will be 70/30 / 60/40 = f which means that "A will be "yes" x times B given their individual chances of no ..
Considering the chances of no is important because "yes" and "no" together make the sample space for each of them.
thus this reasoning can be extended to 3 variales also as the ratio of odds taken under same circumstances either it is 2 variables or three variables.
Thus we can conclude that it is a relative measure and thus can be easily accomodated as a measure of association.
generally working with logarithmic values is more managable so we have log odd's ratio where value can also go negative.howver higher the value indiacates that
they asre related while as it goes on decreading they are less dependant( their occurance together is less likely)
here is sample output of user2.pm...
1629547
,<>000<>1 29.3663 446 77904 446
Sherlock<>Holmes<>2 28.2934 202 202 963
couldn<>t<>3 28.0851 175 175 2448
wasn<>t<>4 27.8913 153 153 2448
wouldn<>t<>5 27.7528 139 139 2448
Hor<>.<>6 27.3715 111 111 65268
hadn<>t<>7 27.3343 104 104 2448
isn<>t<>8 27.2037 95 95 2448
Laer<>.<>9 26.5313 62 62 65268
Project<>Gutenberg<>10 26.4689 57 86 57
Oph<>.<>11 26.4350 58 58 65268
Clown<>.<>12 26.1006 46 46 65268
ain<>t<>13 26.0262 42 42 2448
Ay<>,<>14 25.7353 36 36 77904
O<>Banion<>15 25.7233 34 261 34
Guil<>.<>16 25.7063 35 35 65268
Bang<>Jensen<>17 25.5902 31 31 35
aren<>t<>18 25.5881 31 31 2448
Osr<>.<>19 25.2209 25 25 65268
shouldn<>t<>20 25.2188 24 24 2448
of<>he<>90077 7.9584 1 42133 9317
.<>by<>90078 7.9589 1 65255 5943
a<>of<>90079 7.9596 3 27651 42133
.<>you<>90080 7.9898 1 65255 6071
to<>was<>90081 7.9949 1 32004 12634
.<>is<>90082 8.0481 2 65255 12590
.<>the<>90083 8.0870 12 65255 74521
of<>for<>90084 8.1316 1 42133 10498
a<>to<>90085 8.1386 2 27651 32004
.<>had<>90086 8.1483 1 65255 6773
The<>.<>90087 8.3537 1 7802 65268
the<>s<>90088 8.3777 1 74521 6911
.<>with<>90089 8.5395 1 65255 8870
.<>his<>90090 8.5531 1 65255 8954
.<>a<>90091 8.6123 3 65255 27651
The<>,<>90092 8.6207 1 7802 77904
and<>.<>90093 8.9356 3 34436 65268
the<>is<>90094 9.2484 1 74521 12590
to<>to<>90095 9.3536 1 32004 32004
a<>the<>90096 9.3976 2 27651 74521
the<>the<>90097 10.2880 3 74521 74521
the<>a<>90098 10.3977 1 74521 27651
As you see the ", space" has got rank one which means that they are highly dependant. as u see more .. t preceeded by shoudn and aren does show that t will occu with this.
the last part of the same output show that the data like "the a" and "of he" etc are not at all dependant and can conclude that negative values indicate independance
between the words in the bigrams.thus here is a balance point between the independant and dependant variable values .
TOP 50 COMPARISONS
between user2.pm and tmi.pm
user.pm :
1629547
,<>000<>1 29.3663 446 77904 446
Sherlock<>Holmes<>2 28.2934 202 202 963
couldn<>t<>3 28.0851 175 175 2448
wasn<>t<>4 27.8913 153 153 2448
wouldn<>t<>5 27.7528 139 139 2448
Hor<>.<>6 27.3715 111 111 65268
hadn<>t<>7 27.3343 104 104 2448
isn<>t<>8 27.2037 95 95 2448
Laer<>.<>9 26.5313 62 62 65268
Project<>Gutenberg<>10 26.4689 57 86 57
Oph<>.<>11 26.4350 58 58 65268
Clown<>.<>12 26.1006 46 46 65268
ain<>t<>13 26.0262 42 42 2448
Ay<>,<>14 25.7353 36 36 77904
tmi.pm :
1629547
.<>The<>1 19721.9494 5593 65255 7800
of<>the<>2 16166.1332 10710 42133 74521
,<>and<>3 12077.0140 8584 77904 34436
.<>He<>4 11447.6870 2986 65255 3671
in<>the<>5 10115.0110 6403 23448 74521
.<>It<>6 9809.9502 2555 65255 3130
,<>but<>7 6486.1348 2362 77904 4111
.<>I<>8 5816.9501 3392 65255 12556
.<>But<>9 5762.9153 1506 65255 1846
to<>be<>10 5444.1418 2067 32004 7847
.<>In<>11 5293.2955 1458 65255 1918
,
in true mutual information among the 2 words while user2 gievs us an idea about the independance and dependance between the two words in the bigrams.
if one ses the user2.pm the higher ranks just suggest that though these words have occured very less time individually ,in case of "wasn , t ", the first word occruing in that place
given the chance that the second word occured in that place..eg occurance of wasn given that of t is high that means that it ranks the bigrams depending on the occurance of
first word given the occurance of the bigram which can be interesting measure for "prediction" too.
in case of true mutual information, the stress is more on the individual words occuring together given the chances of their own occurance and occurance together and thus
it gives urs the complete information about two words,occuring together , not occuring together , one occuring but not other .So if you see the higher side eg,words like "the,".","he"
which have highly occured in the text and thus having higher probabaility are making the higher ranks as compared to others whose rank can be lower because of their
individual frequecies and joint frequencies.this can be a good measure to decide the probability of comman words in a texta nd their typical associations.
eg association of " of" with " the" or " to" with " be".
thus both the measure look differently at the text.
CUTOFF POINT :
in case user2, the cutoff will be anything below zero after which the bigrams wont make sense as they cant occur like that in text.
RANK COMPARISON :
rank values for following ...
tmi and mi : 0.8086
user2 and mi : 0.9669
rank values :
ll and user2 : 0.8080
mi and user2 : 0.9669
this shows that the user2 computes similar results.
Overall Recommendation :
The measure of associations used give different interpretation of the given corpora and so they have their own stand.
References :
Books :
Foundations of statistical natural language processing : Manning and Schutze
Introduction to probability and mathematical statistics : Bain and Engelhardt
Nonparametric Statistical Inference : J.Gibbons and S. Chakraborti
Multiway Contingency Tables Analysis for the Social Sciences : Thomas Wickens.
urls :
http://home.clara.net/sisa/two2hlp.htm
http://www.mathpsyc.unibonn.de/doc/cristant/node15.html
http://www.ci.tuwien.ac.at/~zeileis/teaching/slides4.pdf
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
NAME: SUMALATHA KUTHADI
CLASS: NATURAL LANGUAGE PROCESSING
DATE: 10 / 11 / 02
CS8761 : ASSIGNMENT 2
> TO INVESTIGATE VARIOUS MEASURES OF ASSOCIATION THAT CAN BE
USED TO IDENTIFY COLLOCATIONS IN LARGE CORPORA OF TEXT.
INTRODUCTION:
> COLLOCATIONS: A COLLOCATION IS AN EXPRESSION THAT CONSISTS OF
TWO OR MORE WORDS THAT CORRESPOND TO SOME CONVENTIONAL WAY OF
SAYING SOMETHING.
> TEST OF ASSOCIATION: IT IS A TEST WHICH IS CARRIED OUT TO FIND
THE "ASSOCIATION AMONG THE WORDS" IN A COLLOCATION.
> "ASSOCIATION" AMONG THE WORDS IS ALSO CALLED AS "MUTUAL INFORMATION".
> "MUTUAL INFORMATION" MEASURES THE DEGREE OF DEPENDENCE BETWEEN THE
WORDS IN A COLLOCATION. IT IS THE AMOUNT OF INFORMATION ONE RANDOM
VARIABLE CONTAINS ABOUT ANOTHER RANDOM VARIABLE.
> WHEN TEST OF ASSOCIATION IS PERFORMED ON A BIGRAM, IT ASSIGNS A
SCORE AND A RANK TO THE COLLOCATION. RANK GIVEN IS BASED ON THE SCORE.
EXPERIMENTS DONE IN THE ASSIGNMENT:
> tmi.pm: THIS TEST FINDS THE TRUE MUTUAL INFORMATION AMONG THE WORDS IN
THE COLLOCATION AND ASSIGNS SCORES TO THE THEM. RANKS ARE BASED ON THE
SCORES GIVEN TO BIGRAMS.
FORMULA USED:
I(X:Y)= SUMMATION(P(X:Y) * log(P(X:Y)/P(X)*P(Y)) );
I TRUE MUTUAL INFORMATION
P(X:Y)JOINT PROBABILITY OF THE WORDS IN BIGRAM
P(X) PROBABILITY OF FIRST WORD IN BIGRAM
P(Y) PROBABILITY OF SECOND WORD IN BIGRAM
> user2.pm: THE TEST USED IN THIS PROGRAM IS utest. THIS PROGRAM FINDS SIGNIFICANT
COLLOCATIONS. IT ALSO ASSIGNS RANKS TO THE COLLOCATIONS. THESE RANKS
ARE BASED ON THE SCORES ASSIGNED TO THE COLLOCATIONS.
SCORES AND RANKS ARE INVERSELY PROPORTIONAL TO EACH OTHER.
FORMULAE USED:
u = (jointprob  indprob) / sqrt(indprob * (1  indprob) / npp);
jointprob = p(w1,w2) = n(w1,w2) / npp;
indprob = p(w1,w2) = (n(w1) * n(w2)) / (npp * npp);
n(w1) > number of times w1 occurs as 1st word in bigram.
n(w2) > number of times w2 occurs as 2nd word in bigram.
n(w1,w2)> number of times both w1 & w2 occurs as 1st & 2nd words
respectively in the bigram.
npp > total number of bigrams
REFERENCE: http://icl.pku.edu.cn/yujs/papers/pdf/htest.pdf
> user3.pm : THIS PROGRAM IS AN EXTENSION OF user2.pm. THIS PROGRAM FINDS
TRIGRAMS. ALSO, ASSIGNS SCORES AND RANKS TO TRIGRAMS.
FORMULAE USED:
u = (jointprob  indprob) / sqrt(indprob * (1  indprob) / npp);
jointprob = p(w1,w2,w3) = n(w1,w2,w3) / nppp;
indprob = p(w1,w2,w3) = (n(w1) * n(w2) * n(w3)) / (nppp * nppp * nppp);
n(w1) > number of times w1 occurs as 1st word in trigram.
n(w2) > number of times w2 occurs as 2nd word in trigram.
n(w3) > number of times w3 occurs as 3rd word in trigram.
n(w1,w2,w3)> number of times all w1,w2 and w3 occur as 1st,2nd and 3rd
words respectively in the trigram.
nppp > total number of trigrams.
> ll3.pm: "LOGLIKELIHOOD TEST" IS USED IN THIS PROGRAM. THIS PROGRAM FINDS
SIGNIFICANT COLLOCATIONS . THIS PROGRAM ALSO ASSIGNS SCORES AND RANKS TO
COLLOCATIONS.
FORMULAE USED:
g2 = 2*summation(nijk * log (nijk * eijk));
nijk >frequency of trigram ijk.
eijk >estimation of trigram ijk.
eijk = nipp * npjp * nppk / nppp.
nppp > total number of trigrams.
EXPERIMENT 1:
TOP 50 COMPARISON:
>FROM THE TOP 50 RANKS PRODUCED BY user2.pm AND tmi.pm, I OBSERVE THAT tmi.pm
ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS, WHICH OCCUR FREQUENTLY. user2.pm ASSIGNS
HIGH RANKS TO THOSE COLLOCATIONS, IN WHICH 1ST WORD OCCURS ONLY WITH 2ND WORD OR A
FEW OTHER WORDS AND VICEVERSA.
>FOR EXAMPLE: TOP6 RANKS GIVEN BY user2.pm AND tmi.pm.
>OUTPUT OF user2.pm:
BINI<>SALFININISTAS<>1 1155.9191 1 1 1
Multiphastic<>Personality<>1 1155.9191 1 1 1
Theodor<>Reik<>1 1155.9191 1 1 1
PRINCE<>WEARS<>1 1155.9191 1 1 1
Birdie<>Gevurtz<>1 1155.9191 1 1 1
Manon<>Lescaut<>1 1155.9191 1 1 1
> OUTPUT OF tmi.pm:
.<>The<>1 0.0135 4961 49673 6921
of<>the<>2 0.0102 9181 36043 62690
.<>He<>3 0.0071 2421 49673 2991
in<>the<>4 0.0062 5330 19581 62690
,<>and<>5 0.0049 5530 58974 27952
.<>It<>5 0.0049 1718 49673 2184
>I THINK user2.pm IS BETTER THAN tmi.pm AT IDENTIFYING SIGNIFICANT COLLOCATIONS
BECAUSE user2.pm IS GOOD AT IDENTIFYING COLLOCATIONS WHICH HAVE SOME MEANING AND
ASSIGNS THEM HIGH RANKS. tmi.pm CONSIDERS COLLOCATIONS, WHICH OCCUR FREQUENTLY AS
BEST COLLOCATIONS. WORDS LIKE I, HE, HAD, THE, A ETC AND PUNCTUATIONS
OCCUR FREQUENTLY IN ANY TEXT. THESE WORDS AND PUNCTUATIONS ARE ASSOCIATED WITH MANY
OTHER WORDS. SO,COLLOCATIONS WHICH CONTAIN THESE WORDS AND PUNCTUATIONS ARE CONSIDERED
AS BEST COLLOCATIONS BY tmi.pm.
>IN THE TEXT I USED "I HAD" OCCURED 810 TIMES AND "NOMINALLY ESTIMATED" OCCURED
ONLY 1 TIME.
ACCORDING TO user2.pm, "NOMINALLY ESTIMATED" IS THE BEST COLLOCATION WHERE AS ACCORDING
TO tmi.pm, "I HAD" IS A BEST COLLOCATION.
CUTOFF POINT:
>CUTOFF POINT FOR user2.pm:
I THINK THE OUTPUT OF user.pm HAS A NATURAL CUT OFF POINT FOR SCORES AT 1155.9191.
PART OF THE OUTPUT: IT CONTAINS SCORE CUTOFF.
CONTINUUM<>MECHANICS<>1 1155.9191 1 1 1
Stabat<>Mater<>1 1155.9191 1 1 1
Chou<>En<>2 1155.9182 2 2 2
Nogay<>Tartary<>2 1155.9182 2 2 2
Pee<>Wee<>2 1155.9182 2 2 2
>"I OBSERVED THAT ABOVE THE CUTOFF POINT, ALL THE COLLOCATIONS HAVE WORDS THAT DO NOT
APPEAR IN ANY OTHER COLLOCATIONS."
>CUTOFF POINT OF tmi.pm:
I THINK THE OUTPUT OF tmi.pm HAS A NATURAL CUT OFF POINT AT 0.0001 BECAUSE AFTER THAT
POINT SCORE OF ALL COLLOCATIONS IS ZERO.
FROM THE OUTPUT TABLE, I OBSERVED THAT ALL FREQUENCY COMBINATIONS OF ALL BIGRAMS ARE GREATER
THAN ZERO.
PART OF THE OUTPUT: IT CONTAINS SCORE CUTOFF.
contact<>with<>39 0.0001 30 59 7007
returned<>to<>39 0.0001 52 115 25725
wavers<>and<>40 0.0000 1 2 27952
agglutination<>of<>40 0.0000 1 4 36043
effective<>they<>40 0.0000 2 127 2855
>"I THINK 0.0001 IS THE CUTOFF POINT BECAUSE AFTER THAT POINT ALL COLLOCATIONS HAVE SAME RANK.
MIGHT BE SCORE VALUE IS ROUNDED TO 4 POINT PRECESION, BUT THE ACTUAL SCORE OF ALL BIGRAMS MIGHT
BE ALMOST EQUAL TO EACH OTHER. I THINK BELOW THE CUT OFF POINT ALL BIGRAMS ARE ALL EQUALLY SIGNIFICANT.
RANK COMPARISION:
*>COMPARING user2.pm AND tmi.pm WITH mi.pm:
RANK CORRELATION COEFFICIENT OF user2.pm WITH mi.pm: 0.9523
RANK CORRELATION COEFFICIENT OF tmi.pm WITH mi.pm: 0.9528
>BASED ON "RANK CORRELATION COEFFICIENT" AND "RANKINGS" GIVEN BY THE BOTH TESTS,
I THINK THAT user2.pm IS MORE LIKE mi.pm.
>RANKS GIVEN BY tmi.pm AND mi.pm ARE TOTALLY DIFFERENT.
*>COMPARING user2.pm AND tmi.pm WITH ll.pm:
RANK CORRELATION COEFFICIENT OF user2.pm WITH ll.pm: 0.8870
RANK CORRELATION COEFFICIENT OF tmi.pm WITH ll.pm: 0.9527
>BASED ON "RANK CORRELATION COEFFICIENT" I THINK THAT user2.pm IS MORE LIKE ll.pm.
OVERALL RECOMMENDATION:
> I THINK user2.pm IS BETTER THAN ll.pm, tmi.pm AND mi.pm.
tmi.pm AND ll.pm CONSIDER FREQUENTLY OCCURING COLLOCATIONS AS SIGNIFICANT
COLLOCATIONS. SO THEY CONSIDER COLLOCATIONS WHICH HAVE PUNCTUATIONS AND WORDS LIKE THE,
HE, IT ETC AS SIGNIFICANT COLLOCATIONS.
> BETWEEN user2.pm AND mi.pm, I THINK user.pm IS BETTER. BECAUSE, user.pm CONSIDERS
COLLOCATIONS, WHICH CONTAINS WORDS THAT OCCUR ONLY TOGETHER AND NOT WITH ANY OTHER WORDS,
AS SIGNIFICANT COLLOCATIONS. THESE SIGNIFICANT COLLOCATIONS WILL NOT HAVE PUNCTUATIONS
AND WORDS LIKE THE, HE ,IT ETC.
EXPERIMENT 2:
TOP 50 RANK:
>RANK CORRELATION COEFFICIENT: 0.7953
>FROM THE TOP 50 RANKS PRODUCED BY user3.pm AND ll3.pm, I OBSERVE THAT ll3.pm
ASSIGNS HIGH RANKS TO THOSE COLLOCATIONS WHICH OCCUR FREQUENTLY. user3.pm ASSIGNS
HIGH RANKS TO THOSE COLLOCATIONS IN WHICH 1ST WORD OCCURS ONLY WITH 2ND WORD OR A
FEW OTHER WORDS AND VICEVERSA.
>FOR EXAMPLE: TOP6 RANKS GIVEN BY user2.pm AND tmi.pm.
>OUTPUT OF ll3.pm:
,<>and<>,<>1 2923.7663 1 304 149 304 76 14 3
,<>and<>the<>2 2400.4036 6 304 149 172 76 23 13
,<>and<>I<>3 2164.3900 3 304 149 96 76 14 9
,<>and<>my<>4 2096.5959 1 304 149 76 76 4 6
,<>and<>was<>5 2031.9212 3 304 149 55 76 15 3
,<>and<>that<>6 2024.7081 4 304 149 64 76 7 9
>OUTPUT OF user3.pl
true<>repenting<>prodigal<>1 3870.9997 1 1 1 1 1 1 1
wer<>n<>t<>1 3870.9997 1 1 1 1 1 1 1
Low<>Country<>wars<>1 3870.9997 1 1 1 1 1 1 1
secret<>burning<>lust<>1 3870.9997 1 1 1 1 1 1 1
eighteen<>years<>old<>1 3870.9997 1 1 1 1 1 1 1
battle<>near<>Dunkirk<>2 2737.2100 1 1 2 1 1 1 1
>I THINK user3.pm IS BETTER THAN ll3.pm AT IDENTIFYING SIGNIFICANT COLLOCATIONS
BECAUSE user2.pm IS GOOD AT IDENTIFYING COLLOCATIONS WHICH HAVE SOME MEANING AND
ASSIGNS THEM HIGH RANKS. tmi.pm CONSIDERS COLLOCATIONS, WHICH OCCUR FREQUENTLY, AS
BEST COLLOCATIONS. WORDS LIKE I, HE, HAD, THE, A ETC AND PUNCTUATIONS
OCCUR FREQUENTLY IN ANY TEXT. THESE WORDS AND PUNCTUATIONS ARE ASSOCIATED WITH MANY
OTHER WORDS. SO,COLLOCATIONS, WHICH CONTAIN THESE WORDS AND PUNCTUATIONS ARE CONSIDERED
AS BEST COLLOCATIONS BY tmi.pm.
>IN THE TEXT I USED, "TRUE REPENTING PRODIGAL" OCCURED 1 TIME AND ", and ,"
OCCURED ONLY 76 TIMES.
ACCORDING TO user3.pm, "TRUE REPENTING PRODIGAL" IS THE BEST COLLOCATION WHERE AS ACCORDING
TO ll3.pm, ", and ," IS A BEST COLLOCATION.
CUTOFF POINT:
>CUTOFF POINT FOR SCORES OF user3.pm:
CUTOFF POINT: 3870.9997
ABOVE THIS POINT ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ONLY WITH THAT COLLOCATION.
BELOW THIS POINT ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ALSO IN OTHER
COLLOCATIONS.
>CUTOFF POINT FOR SCORES OF ll3.pm:
CUTOFF POINT: 92244.4599
ABOVE THIS POINT, ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ONLY WITH THAT COLLOCATION.
BELOW THIS POINT, ALL THE COLLOCATIONS HAVE WORDS THAT OCCUR ALSO IN OTHER
COLLOCATIONS
OVERALL RECOMMENDATION:
I THINK user3.pm IS BETTER THAN ll3.pm.AS I SAID ABOVE, user3.pm CONSIDERS COLLOCATIONS,
THAT HAVE SOME MEANING, AS SIGNIFICANT COLLOCATIONS.
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
# *********************************************************************************
# experiments.txt Report for Assignment #2 Ngram Statistics
# Name: Yanhua Li
# Class: CS 8761
# Assignment #2: Oct. 11, 2002
# *********************************************************************************
This experiment compares several measures of association in large corpora of text.
I use the text from http://ibiblio.unc.edu/pub/docs/books/gutenberg/etext02/cwces10.txt
Commands to use:
First run
count.pl cwces10.cnt cwces10.txt
count.pl ngram 3 cwces10_3.cnt cwces10.txt
to get bigrams, trigrams and their frequencies.
Then run
statistic.pl mi cwces10.mi cwces10.cnt
statistic.pl tmi cwces10.tmi cwces10.cnt
statistic.pl user2 cwces10.user2 cwces10.cnt
statistic.pl ll cwces10.ll cwces10.cnt
statistic.pl ngram 3 ll3 test.ll3 cwces10_3.cnt
statistic.pl ngram 3 cwces10.user3 cwces10_3.cnt
%%%%%%%%%%%%%%
EXPERIMENT 1
1.1 TOP 50 COMPARISON
1.1.1 user2.pm
Using user2.pm (namely, phi coefficient method) to run NSP
Part of results
*********************************************
136218
SEND<>MONEY<>1 1.0000 1 1 1
loftiest<>pulpit<>1 1.0000 1 1 1
revised<>rates<>1 1.0000 1 1 1
irreclaimable<>givers<>1 1.0000 1 1 1
Petite<>Provence<>1 1.0000 1 1 1
HIS<>WIFE<>1 1.0000 3 3 3
NEW<>FEMININE<>1 1.0000 3 3 3
Revenue<>Service<>1 1.0000 1 1 1
CLEARING<>HOUSE<>1 1.0000 3 3 3
knicker<>bockers<>1 1.0000 1 1 1
palatial<>apartment<>1 1.0000 1 1 1
PREVAILING<>DISCONTENT<>1 1.0000 2 2 2
25<>Short<>1 1.0000 1 1 1
Copper<>Cylinder<>1 1.0000 1 1 1
ablution<>bowls<>1 1.0000 1 1 1
COMMON<>SCHOOL<>1 1.0000 1 1 1
THOUGHTS<>SUGGESTED<>1 1.0000 1 1 1
Treasury<>sends<>1 1.0000 1 1 1
NATHAN<>HALE<>1 1.0000 3 3 3
railroad<>fare<>1 1.0000 1 1 1
boats<>glide<>1 1.0000 1 1 1
angular<>corners<>1 1.0000 1 1 1
Uncle<>Sam<>1 1.0000 2 2 2
DON<>T<>1 1.0000 1 1 1
ENGLISH<>VOLUNTEERS<>1 1.0000 3 3 3
SOME<>CAUSES<>1 1.0000 2 2 2
brevity<>recommends<>1 1.0000 1 1 1
promotes<>trading<>1 1.0000 1 1 1
300<>billion<>1 1.0000 1 1 1
Small<>Print<>1 1.0000 6 6 6
.
.
. (totally 330 bigrams rank No.1)
o<>clock<>2 0.9354 7 8 7
LEISURE<>CLASS<>3 0.8660 3 3 4
WITH<>AN<>3 0.8660 3 4 3
MS<>38655<>3 0.8660 3 4 3
hart<>pobox<>3 0.8660 3 4 3
AN<>EGO<>3 0.8660 3 3 4
Long<>Island<>4 0.8165 2 3 2
THIS<>ETEXT<>4 0.8165 2 2 3
EVEN<>IF<>4 0.8165 2 2 3
Private<>Theatricals<>4 0.8165 2 3 2
GET<>GUTINDEX<>4 0.8165 2 2 3
Meredith<>understands<>4 0.8165 2 2 3
DOMAIN<>ETEXTS<>4 0.8165 2 2 3
BUT<>NOT<>4 0.8165 2 2 3
Project<>Gutenberg<>4 0.8165 18 27 18
seventh<>son<>4 0.8165 2 3 2
Lydia<>Blood<>4 0.8165 2 3 2
Sandwich<>Islands<>4 0.8165 2 2 3
United<>States<>5 0.8086 34 34 52
.
.
.
to<>and<>1533 0.0279 1 3528 3960
to<>.<>1534 0.0286 2 3528 4219
.<>and<>1535 0.0304 2 4219 3960
.<>.<>1536 0.0307 5 4219 4219
to<>of<>1537 0.0337 1 3528 5668
in<>,<>1538 0.0349 9 2828 8193
of<>and<>1539 0.0358 1 5668 3960
,<>of<>1540 0.0365 105 8193 5668
of<>.<>1541 0.0366 3 5668 4219
.<>of<>1542 0.0370 1 4219 5668
a<>,<>1543 0.0376 1 2975 8193
and<>,<>1544 0.0388 27 3960 8193
to<>,<>1545 0.0401 6 3528 8193
.<>,<>1546 0.0447 3 4219 8193
of<>,<>1547 0.0505 14 5668 8193
the<>,<>1548 0.0639 1 8202 8193 (the end)
****************************************
Phi coefficient = 1 means this bigram is highly dependent. We can see the last three
digits at each row of No.1 ranked bigram are the same. That means those two words
in this bigram always appear together so we think they are dependent. Actually this
measure is not very good since most of those ranked No.1 bigrams only appear once
in this large text and some of them are words seldom used. So we cant simply say
they are dependent. TOP 50 all mean two words of the bigram have some kind of high
dependence. The bottom ranks mean those two words are independent. Phi coefficient
method is good at this measure because those words appear together for very few
times but appear individually a lot.
1.1.2 tmi.pm
Using tmi.pm (namely, "true" mutual information method) to run NSP
Part of results
**************************************
136218
,<>and<>1 0.0302 1533 8193 3960
of<>the<>2 0.0207 1392 5668 8202
.<>The<>3 0.0202 573 4219 659
.<>It<>4 0.0170 472 4219 508
in<>the<>5 0.0127 783 2828 8202
it<>is<>6 0.0112 414 1667 2624
It<>is<>7 0.0103 288 508 2624
,<>but<>8 0.0092 352 8193 501
to<>be<>9 0.0087 341 3528 1183
.<>We<>10 0.0081 229 4219 260
.<>But<>11 0.0072 198 4219 203
.<>And<>12 0.0060 169 4219 193
is<>not<>13 0.0056 224 2624 1076
.<>In<>14 0.0051 143 4219 163
would<>be<>15 0.0050 131 405 1183
to<>the<>15 0.0050 523 3528 8202
,<>or<>16 0.0049 262 8193 739
.<>This<>17 0.0048 135 4219 150
.<>There<>18 0.0047 133 4219 149
may<>be<>19 0.0046 114 301 1183
has<>been<>20 0.0045 93 518 263
.<>If<>21 0.0044 125 4219 143
the<>world<>22 0.0043 169 8202 253
have<>been<>23 0.0042 94 697 263
there<>is<>23 0.0042 129 318 2624
does<>not<>24 0.0041 86 118 1076
.<>He<>25 0.0037 104 4219 112
do<>not<>26 0.0036 87 229 1076
is<>a<>27 0.0035 234 2624 2975
;<>but<>28 0.0034 89 660 501
for<>the<>28 0.0034 231 939 8202
they<>are<>29 0.0033 90 515 766
we<>have<>30 0.0032 91 652 697
on<>the<>31 0.0031 172 515 8202
the<>most<>32 0.0030 126 8202 229
should<>be<>32 0.0030 75 198 1183
must<>be<>32 0.0030 70 148 1183
I<>am<>33 0.0029 46 389 46
in<>a<>33 0.0029 216 2828 2975
will<>be<>33 0.0029 82 337 1183
by<>the<>34 0.0028 184 736 8202
United<>States<>34 0.0028 34 34 52
. (omit 110 raws)
.
.
New<>England<>50 0.0011 16 62 56
as<>well<>50 0.0011 30 976 114
,<>with<>50 0.0011 107 8193 682
there<>are<>50 0.0011 34 318 766
no<>more<>50 0.0011 30 328 409
.<>Some<>50 0.0011 30 4219 32
it<>would<>50 0.0011 46 1667 405
it<>may<>50 0.0011 42 1667 301
,<>he<>50 0.0011 98 8193 562
.
.
.
it<>to<>63 0.0002 18 1667 3528
that<>of<>64 0.0003 41 1912 5668
,<>is<>64 0.0003 126 8193 2624
in<>,<>64 0.0003 9 2828 8193
is<>.<>64 0.0003 15 2624 4219
not<>,<>64 0.0003 25 1076 8193
,<>a<>64 0.0003 148 8193 2975
of<>that<>64 0.0003 36 5668 1912
and<>to<>64 0.0003 67 3960 3528
and<>is<>64 0.0003 16 3960 2624
be<>,<>64 0.0003 20 1183 8193
,<>to<>65 0.0004 174 8193 3528
is<>of<>65 0.0004 19 2624 5668
that<>,<>65 0.0004 30 1912 8193
is<>,<>66 0.0005 94 2624 8193
of<>,<>66 0.0005 14 5668 8193
and<>of<>67 0.0006 52 3960 5668
and<>,<>67 0.0006 27 3960 8193
,<>of<>68 0.0013 105 8193 5668 (the end)
**************************************
From the result, we can see tmi method is much better than phi coefficient method
at identifying significant or interesting collocations. TOP 50 shows very high
dependence and that is the case. It's independence analysis is not as good as phi
coefficient method as we can see from last 3 digits of bottom row bigrams. Therefore
tmi is better in determining dependence but phi coefficient is better in determining
independence.
1.2 CUTOFF POINT
1.2.1 user2.pm
Part of results
******************************************
136218
SEND<>MONEY<>1 1.0000 1 1 1
loftiest<>pulpit<>1 1.0000 1 1 1
revised<>rates<>1 1.0000 1 1 1
irreclaimable<>givers<>1 1.0000 1 1 1
Petite<>Provence<>1 1.0000 1 1 1
HIS<>WIFE<>1 1.0000 3 3 3
NEW<>FEMININE<>1 1.0000 3 3 3
Revenue<>Service<>1 1.0000 1 1 1
CLEARING<>HOUSE<>1 1.0000 3 3 3
knicker<>bockers<>1 1.0000 1 1 1
.
.
.
(possible cut off point)
is<>because<>1354 0.0005 2 2624 92
public<>as<>1354 0.0005 1 115 976
,<>determined<>1354 0.0005 1 8193 14
,<>trained<>1354 0.0005 1 8193 14
and<>equal<>1354 0.0005 1 3960 29
book<>a<>1354 0.0005 1 38 2975
the<>large<>1354 0.0005 2 8202 29
world<>to<>1354 0.0005 7 253 3528
house<>a<>1354 0.0005 1 38 2975
none<>the<>1354 0.0005 1 14 8202
to<>foreign<>1354 0.0005 1 3528 32
,<>current<>1354 0.0005 1 8193 14
case<>.<>1354 0.0005 1 27 4219
them<>an<>1354 0.0005 1 222 507
of<>brilliant<>1354 0.0005 1 5668 20
early<>,<>1354 0.0005 1 14 8193
Project<>.<>1354 0.0005 1 27 4219
that<>money<>1355 0.0004 1 1912 62
her<>an<>1355 0.0004 1 229 507
.
.
.
.<>.<>1536 0.0307 5 4219 4219
to<>of<>1537 0.0337 1 3528 5668
in<>,<>1538 0.0349 9 2828 8193
of<>and<>1539 0.0358 1 5668 3960
,<>of<>1540 0.0365 105 8193 5668
of<>.<>1541 0.0366 3 5668 4219
.<>of<>1542 0.0370 1 4219 5668
a<>,<>1543 0.0376 1 2975 8193
and<>,<>1544 0.0388 27 3960 8193
to<>,<>1545 0.0401 6 3528 8193
.<>,<>1546 0.0447 3 4219 8193
of<>,<>1547 0.0505 14 5668 8193
the<>,<>1548 0.0639 1 8202 8193 (the end)
**************************************
There is a possible cutoff point. Before that, two words of most bigrams don't
appear a lot in this text or in English but after that, there are mostly familiar
words. It means phi coefficient method is good at distinguishing individual
high frequency words from low ones.
1.2.2 tmi.pm
Part of results
**************************************
136218
,<>and<>1 0.0302 1533 8193 3960
of<>the<>2 0.0207 1392 5668 8202
.<>The<>3 0.0202 573 4219 659
.<>It<>4 0.0170 472 4219 508
in<>the<>5 0.0127 783 2828 8202
it<>is<>6 0.0112 414 1667 2624
It<>is<>7 0.0103 288 508 2624
,<>but<>8 0.0092 352 8193 501
.
.
.
much<>depended<>60 0.0001 1 192 4
reservoir<>of<>60 0.0001 2 2 5668
expensive<>mistake<>60 0.0001 1 6 23
the<>bench<>60 0.0001 3 8202 4
had<>crowded<>60 0.0001 1 236 5
s<>actual<>60 0.0001 1 119 10
it<>certainly<>60 0.0001 3 1667 21
two<>million<>60 0.0001 1 60 7
(possible cutoff point1)
born<>into<>60 0.0001 2 37 250
popular<>poet<>60 0.0001 1 48 17
irreclaimable<>givers<>60 0.0001 1 1 1
facades<>;<>60 0.0001 1 1 660
the<>spirits<>60 0.0001 3 8202 3
Petite<>Provence<>60 0.0001 1 1 1
consistent<>ourselves<>60 0.0001 1 5 23
.
.
.
artist<>only<>61 0.0000 1 9 272
that<>flower<>61 0.0000 1 1912 12
or<>dignity<>61 0.0000 1 739 11
reports<>,<>61 0.0000 3 15 8193
soon<>have<>61 0.0000 1 18 697
world<>if<>61 0.0000 1 253 354
fruit<>,<>61 0.0000 2 6 8193
nature<>into<>61 0.0000 1 118 250
,<>harmony<>61 0.0000 1 8193 11
(possible cutoff point2)
,<>not<>62 0.0001 56 8193 1076
as<>of<>62 0.0001 10 976 5668
life<>the<>62 0.0001 5 360 8202
.<>of<>62 0.0001 1 4219 5668
for<>that<>62 0.0001 5 939 1912
is<>for<>62 0.0001 9 2624 939
.<>,<>62 0.0001 3 4219 8193
for<>,<>62 0.0001 6 939 8193
has<>,<>62 0.0001 3 518 8193
if<>,<>62 0.0001 2 354 8193 (the end)
************************************
Possible cutoff point 1
Before that, bigrams have some sort of strong dependence. After that, there are some
low appearing words and don't show much dependence.
Possible cutoff point 2
Before that, one word in bigram appears a lot but another one appears much less
times. After that, both words are high frequency words. They are both pretty much
independent.
This tells that phi coefficient method is better at distinguishing highly used words
from seldom used words and tmi method more emphasizes on dependence and
independence.
1.3 RANK COMPARISON
Run rank.pl cwces10.ll cwces10.tmi
we can get Rank correlation coefficient = 0.4412
This means log likelihood method and tmi have some match although not perfect match.
Run rank.pl cwces10.ll cwces10.user2
we can get Rank correlation coefficient = 0.9182
This means phi coefficient method has a very good match with log likelihood method.
Run rank.pl cwces10.mi cwces10.tmi
we can get Rank correlation coefficient = 0.3310
This means mi method and tmi method have some degree of match but not much.
Run rank.pl cwces10.mi cwces10.user2
we can get Rank correlation coefficient = 0.9629
This means mi method and phi coefficient method have almost perfect match!
1.4 OVERALL RECOMMENDATION
Phi coefficient method is a good measure of association for 2 by 2 tables. From
rank program, we can see Phi coefficient method and mi method and log likelihood
method have great match and all emphasize on distinguishing familiar words from
unfamiliar ones and should be better at measuring independence than dependence.
"True mutual information" method emphasizes more on dependenceindependence issue,
and it is better at detecting dependence. So tmi.pl is the best among those 4 for
identifying significant collocations in large corpora of text.
Therefore according to the results we want from testing, we can choose among them.
%%%%%%%%%%%%%%
EXPERIMENT 2
2.1 TOP 50 COMPARISON
2.1.1 user3.pm
Using user3.pm (namely, phi coefficient method) to run NSP
Part of results
******************************************************
136217
rt<>lsm<>r<>1 369.0745 1 1 1 1 1 1 1
Sauve<>qui<>peut<>1 369.0745 1 1 1 1 1 1 1
Free<>Plain<>Vanilla<>1 369.0745 1 1 1 1 1 1 1
Vanilla<>Electronic<>Texts<>1 369.0745 1 1 1 1 1 1 1
Porte<>du<>Pont<>1 369.0745 1 1 1 1 1 1 1
US<>Internal<>Revenue<>1 369.0745 1 1 1 1 1 1 1
Plain<>Vanilla<>Electronic<>1 369.0745 1 1 1 1 1 1 1
MEDIUM<>IT<>MAY<>1 369.0745 1 1 1 1 1 1 1
demolishing<>Notre<>Dame<>1 369.0745 1 1 1 1 1 1 1
morituri<>te<>salutamus<>1 369.0745 1 1 1 1 1 1 1
maidenly<>screams<>waken<>1 369.0745 1 1 1 1 1 1 1
quid<>pro<>quo<>1 369.0745 1 1 1 1 1 1 1
Revenue<>Service<>IRS<>1 369.0745 1 1 1 1 1 1 1
debased<>Austrian<>coin<>1 369.0745 1 1 1 1 1 1 1
mensa<>et<>thoro<>1 369.0745 1 1 1 1 1 1 1
Internal<>Revenue<>Service<>1 369.0745 1 1 1 1 1 1 1
Parc<>aux<>Cerfs<>1 369.0745 1 1 1 1 1 1 1
25<>Short<>Studies<>1 369.0745 1 1 1 1 1 1 1
WHOM<>SHAKESPEARE<>WROTE<>1 369.0745 1 1 1 1 1 1 1
b<>rt<>lsm<>1 369.0745 1 1 1 1 1 1 1
troublesome<>fly<>sheet<>1 369.0745 1 1 1 1 1 1 1
du<>Pont<>Tournant<>1 369.0745 1 1 1 1 1 1 1
foul<>oaths<>tore<>1 369.0745 1 1 1 1 1 1 1
La<>Petite<>Provence<>1 369.0745 1 1 1 1 1 1 1
UNDER<>STRICT<>LIABILITY<>2 369.0738 1 2 1 1 1 1 1
.
.
.
the<>higher<>would<>7 369.0732 1 8202 68 405 36 33 1
may<>be<>refined<>7 369.0732 1 301 1183 6 114 1 1
inevitable<>remedy<>is<>7 369.0732 1 6 13 2624 1 1 1
other<>introduction<>than<>7 369.0732 1 207 4 334 1 5 1
sound<>.<>In<>7 369.0732 1 12 4219 163 2 1 143
says<>that<>this<>7 369.0732 1 16 1912 743 3 1 37
services<>announced<>,<>7 369.0732 1 4 1 8193 1 1 1
for<>dancing<>,<>7 369.0732 1 939 2 8193 1 106 1
meteors<>were<>expected<>7 369.0732 1 3 236 29 1 1 2
Bench<>,<>and<>7 369.0732 1 3 8193 3960 1 1 1533
an<>incident<>,<>7 369.0732 2 507 7 8193 3 38 2
to<>the<>entrance<>7 369.0732 1 3528 8202 4 523 1 3
a<>charm<>sometimes<>7 369.0732 1 2975 27 33 3 1 1
entirely<>naturalized<>?<>7 369.0732 1 10 6 397 1 1 2
Upon<>the<>theory<>7 369.0732 1 5 8202 22 3 1 7
at<>rest<>,<>7 369.0732 1 377 35 8193 2 42 1
the<>American<>mind<>7 369.0732 1 8202 134 121 62 19 1 (the end)
********************************************************
From the result, we can see phi coefficient method is not very good at measuring
3way tables. It is not sensitive enough since that is chisquare statistic divided
by total numbers of trigrams, then take square root. So the phi values are small. From
this we can also see Pearson's chisquare method is better than phi coefficient
method on 3way tables since the value is much bigger so much more sensitive on
ranking. For phi coefficient method we only ranked to No. 7. Top ranks are similar
to experiment 1. They all appear only once in the text and most are infrequency
appearing words so it doesn't really mean they are dependent. the bottom rows show
independence very well.
2.1.2 ll3.pm
Using ll3.pm (namely, log likelihood method) to run NSP
Part of results
*********************************************
136217
the<>world<>of<>1 3215299.9030 1 8202 253 5668 169 1988 10
the<>sort<>of<>2 3215537.4462 5 8202 104 5668 8 1988 86
the<>most<>of<>3 3215540.6047 4 8202 229 5668 126 1988 24
the<>public<>of<>4 3215791.9803 1 8202 115 5668 65 1988 1
.<>It<>is<>5 3215837.6590 266 4219 508 2624 472 461 288
the<>best<>of<>6 3215841.6542 3 8202 85 5668 52 1988 4
the<>whole<>of<>7 3215856.6467 1 8202 54 5668 40 1988 1
the<>part<>of<>8 3215863.4397 2 8202 49 5668 3 1988 34
the<>highest<>of<>9 3215865.4608 1 8202 48 5668 37 1988 1
the<>negro<>of<>10 3215879.2826 1 8202 61 5668 39 1988 1
the<>first<>of<>11 3215879.9487 2 8202 66 5668 41 1988 3
the<>South<>of<>12 3215882.2250 1 8202 45 5668 34 1988 1
the<>pursuit<>of<>13 3215902.5578 16 8202 36 5668 20 1988 25
the<>number<>of<>14 3215904.5503 10 8202 33 5668 11 1988 26
the<>end<>of<>15 3215917.5920 15 8202 59 5668 26 1988 26
the<>development<>of<>16 3215922.1584 19 8202 60 5668 21 1988 29
the<>young<>of<>17 3215924.7252 1 8202 111 5668 41 1988 1
the<>sense<>of<>18 3215927.3588 3 8202 47 5668 4 1988 26
the<>production<>of<>19 3215940.9602 14 8202 24 5668 14 1988 17
the<>kind<>of<>20 3215948.4560 5 8202 35 5668 6 1988 21
the<>mass<>of<>21 3215949.1331 9 8202 30 5668 16 1988 16
the<>amount<>of<>22 3215949.5086 5 8202 19 5668 6 1988 16
the<>matter<>of<>23 3215949.6262 4 8202 56 5668 11 1988 23
the<>view<>of<>24 3215950.7291 1 8202 38 5668 2 1988 20
the<>power<>of<>25 3215950.9725 12 8202 87 5668 22 1988 28
the<>habit<>of<>26 3215952.1156 4 8202 36 5668 7 1988 20
the<>necessity<>of<>27 3215952.3455 11 8202 22 5668 12 1988 16
the<>other<>of<>28 3215953.9700 1 8202 207 5668 45 1988 1
the<>fact<>of<>29 3215955.5270 1 8202 66 5668 28 1988 3
the<>stage<>of<>30 3215957.3509 1 8202 26 5668 18 1988 3
the<>point<>of<>31 3215957.6134 3 8202 30 5668 7 1988 17
the<>imagination<>of<>32 3215959.7609 2 8202 27 5668 19 1988 2
the<>possession<>of<>33 3215960.0980 7 8202 19 5668 7 1988 15
the<>writer<>of<>34 3215960.9647 1 8202 38 5668 21 1988 3
the<>rest<>of<>35 3215963.5930 14 8202 35 5668 16 1988 14
the<>spirit<>of<>36 3215966.7353 11 8202 83 5668 17 1988 26
the<>result<>of<>37 3215969.0320 10 8202 38 5668 17 1988 15
the<>knowledge<>of<>38 3215970.5332 1 8202 66 5668 2 1988 21
the<>state<>of<>39 3215974.9168 2 8202 35 5668 5 1988 16
the<>standard<>of<>40 3215975.4003 7 8202 26 5668 8 1988 15
the<>chief<>of<>41 3215976.2044 1 8202 22 5668 15 1988 1
the<>effect<>of<>42 3215977.2412 9 8202 52 5668 19 1988 15
the<>author<>of<>43 3215977.2424 2 8202 25 5668 16 1988 2
the<>millions<>of<>44 3215977.8912 2 8202 17 5668 3 1988 12
the<>influence<>of<>45 3215978.3595 11 8202 36 5668 15 1988 14
the<>line<>of<>46 3215979.1140 8 8202 33 5668 13 1988 14
the<>conception<>of<>47 3215980.2239 1 8202 19 5668 1 1988 12
the<>study<>of<>48 3215980.4541 9 8202 33 5668 9 1988 15
the<>form<>of<>49 3215980.7319 10 8202 46 5668 10 1988 17
the<>recognition<>of<>50 3215981.0518 3 8202 12 5668 4 1988 10
.
.
.
thing<>,<>the<>85636 3220712.5856 1 79 8193 8202 6 8 475
possible<>,<>the<>85637 3220712.6360 1 42 8193 8202 4 4 475
life<>which<>the<>85638 3220712.6470 1 360 551 8202 3 23 34
,<>the<>labor<>85639 3220712.7652 1 8193 8202 36 475 3 3
,<>the<>better<>85640 3220712.7882 1 8193 8202 66 475 3 5
present<>,<>the<>85641 3220712.9113 1 45 8193 8202 3 3 475
English<>,<>the<>85642 3220713.4347 1 59 8193 8202 4 5 475
and<>men<>of<>85643 3220713.6153 1 3960 238 5668 9 173 13
see<>,<>the<>85644 3220713.9829 1 88 8193 8202 6 7 475
you<>,<>the<>85645 3220714.5447 1 172 8193 8202 9 10 475 (end)
*****************************************************************
One interesting thing here is that those trigrams are almost all "the ... of" and
we know "the ... of" is a very high frequent appearing word and it's frequency is
1988 and it does show dependence between those two words. Compare the result with
user3, we can see ll3 is much better at measuring dependence especially two words
dependence. However, phi coefficient method is better at detecting independence.
2.2 CUTOFF POINT
2.2.1 user3.pm
Part of results
*************************************************************
.
.
.
CULTURE<>TO<>ME<>6 369.0733 1 1 10 1 1 1 1
IDLENESS<>IS<>THERE<>6 369.0733 2 3 6 3 2 2 3
EVEN<>IF<>YOU<>6 369.0733 2 2 3 7 2 2 3
Country<>Parson<>feed<>6 369.0733 1 4 1 3 1 1 1
cd<>pub<>docs<>6 369.0733 1 2 2 2 1 1 2
SHAKESPEARE<>WROTE<>AS<>6 369.0733 1 1 1 9 1 1 1 (possible cutoff point)
in<>slumber<>?<>7 369.0732 1 2828 1 397 1 6 1
one<>side<>or<>7 369.0732 1 437 23 739 4 8 1
night<>,<>with<>7 369.0732 1 44 8193 682 6 1 107
the<>stranger<>,<>7 369.0732 1 8202 4 8193 2 774 1
any<>competent<>pharmacist<>7 369.0732 1 325 9 1 1 1 1
In<>a<>piece<>7 369.0732 1 163 2975 11 13 1 6
should<>miss<>for<>7 369.0732 1 198 2 939 1 1 1
to<>uncover<>it<>7 369.0732 1 3528 1 1667 1 79 1
wide<>awake<>citizen<>7 369.0732 1 19 3 18 1 1 1
The<>mention<>of<>7 369.0732 1 660 2 5668 1 119 1
certain<>duties<>in<>7 369.0732 1 89 19 2828 3 5 1
.
.
.
************************************************************
Before that point, all three words of trigrams don't appear a lot but after that point,
at least one word or a 2word association has high frequency. It also means phi
coefficient method is good at distinguishing between individual high and low
frequency words.
2.2.2 ll3.pm
Part of results
**************************************
.
.
.
f<>wheels<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>Newport<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>Venus<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>egalite<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>cooking<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>mines<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>expansion<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>Maine<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1
of<>repartee<>and<>28664 3220648.0608 1 5668 2 3960 1 267 1 (possible cutoff point)
I<>wondered<>if<>28665 3220648.0652 3 389 8 354 4 5 3
long<>as<>we<>28666 3220648.0656 1 74 976 652 4 1 28
we<>call<>progress<>28667 3220648.0662 1 652 35 24 9 1 1
form<>of<>conversations<>28668 3220648.0681 1 46 5668 3 17 1 2
has<>given<>us<>28669 3220648.0702 2 518 33 168 6 3 4
women<>,<>that<>28670 3220648.0731 1 196 8193 1912 20 2 204
to<>him<>to<>28671 3220648.0780 1 3528 163 3528 22 67 16
the<>Tuileries<>was<>28672 3220648.0812 2 8202 12 586 11 44 2
when<>she<>wore<>28673 3220648.0824 1 231 169 6 9 1 2
learning<>they<>would<>28674 3220648.0859 1 14 515 405 1 1 18
a<>month<>,<>28675 3220648.0966 1 2975 23 8193 9 252 2
It<>may<>actually<>28676 3220648.0996 1 508 301 12 16 1 1
the<>country<>except<>28677 3220648.1016 1 8202 96 39 30 6 1
exciting<>accumulation<>of<>28678 3220648.1103 1 4 12 5668 1 3 9
to<>making<>a<>28679 3220648.1114 1 3528 36 2975 2 153 3
?<>He<>has<>28680 3220648.1191 1 396 112 518 3 2 12
down<>to<>inspect<>28681 3220648.1270 1 79 3528 2 17 1 2
of<>every<>human<>28682 3220648.1298 2 5668 100 116 14 15 6
man<>who<>ever<>28683 3220648.1354 1 203 379 72 13 1 2
eighteenth<>century<>to<>28684 3220648.1357 1 5 25 3528 4 1 2
was<>of<>more<>28685 3220648.1433 1 586 5668 409 3 5 2
about<>some<>of<>28686 3220648.1444 1 211 208 5668 1 2 36
original<>conception<>of<>28687 3220648.1456 1 17 19 5668 1 4 12
the<>next<>generation<>28688 3220648.1457 2 8202 25 32 12 5 3
.
.
.
********************************************************************
Before that point, there is always a 2word association having high frequency but
after that point, no very high frequency for any 2word association. It also means
log likelihood method is good at detecting 2word association.
2.3 OVERALL RECOMMENDATION
Phi coefficient method is not very good for measuring trigrams. It's quite different
from Pearson's chisquare method which should be much better on this. Log likelihood
method is better at detecting 2word association in trigrams.
I searched on the web and books at math department and also consulted Dr. Kang James
about a good method for both 2way and 3way measure of association. She suggested
log linearmodel and said there were not many measures other than pearson's
chisquare, log likelihood and Fisher's method?? Log linearmodel is complicated
and I haven't found a good model to implement it so I decided to still use phi
coefficient method. Until conducting this experiment, I then realized that is not a
very good measure for trigrams although it is good for bigrams...
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
maha0084/ll3.pm0100666016600500004710000000763207551507247011450 0ustar maha0084package ll3;
require Exporter;
@ISA = qw ( Exporter );
@EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString);
# Aniruddha Mahajan  maha0084
# CS8761 Fall 2002
# NLP Assignment 2
#
#Log Likelihood Ratio for Trigrams 
sub initializeStatistic
{
($ngrams, $totaltrigrams, $freqarrindex, $freqarr) = @_;
}
sub calculateStatistic
{
@arr = @_; #get array
$tott = $totaltrigrams;
# xyz x y z xy xz yz order of arguments as received from statistic.pl
$xyz = $arr[0];
$x = $arr[1];
$y = $arr[2];
$z = $arr[3];
$xy = $arr[4];
$xz = $arr[5];
$yz = $arr[6];
$xbaryz = $yz  $xyz; # calculate values of variables on cube of 2x2x2 + totals
$xbarz = $z  $xz; # 27 values in all .. including totals of all 3 matrices
$ybarz = $z  $yz;
$xbarybarz = $xbarz  $xbaryz;
$xybarz = $xz  $xyz;
$xybar = $x  $xy;
$xbary = $y  $xy;
$xbar = $tott  $x;
$xbarybar = $xbar  $xbary;
$ybar = $tott  $y;
$xyzbar = $xy  $xyz;
$yzbar = $y  $yz;
$xzbar = $x  $xz;
$zbar = $tott  $z;
$xybarzbar = $xybar  $xybarz;
$xbaryzbar = $yzbar  $xyzbar;
$xbarzbar = $zbar  $xzbar;
$xbarybarzbar = $xbarzbar  $xbaryzbar;
$ybarzbar = $zbar  $yzbar;
#
#CalculatingExpectedValues
$exyz = ($x*$y*$z)/($tott*$tott); # expected values > formula derived as shown......
$exbaryz = ($xbar*$y*$z)/($tott*$tott); # from Null Hypothesis ... P(xyz) = P(x)*P(y)*P(z)
$exybarz = ($x*$ybar*$z)/($tott*$tott); # thus (exyz/tott) = (x/tott) * (y/tott) * (z/tott)
$exbarybarz = ($xbar*$ybar*$z)/($tott*$tott); # exyz = x*y*z/tott*tott
$exyzbar = ($x*$y*$zbar)/($tott*$tott); #
$exbaryzbar = ($xbar*$y*$zbar)/($tott*$tott); #
$exybarzbar = ($x*$ybar*$zbar)/($tott*$tott);
$exbarybarzbar = ($xbar*$ybar*$zbar)/($tott*$tott);
#
#CalculateLogLikelihoodRatios
$llrr = 0;
if($exyz!=0 && $xyz!=0)
{
$llr1 = $xyz/$exyz; # calculate log liklihood ratio of each of the 8 possible combinations of x,y,z
$llr1 = log($llr1)/log(2);
$llr1 = $xyz * $llr1;
$llrr += $llr1; # llrr is summation of all such calculated values i.e. log likelihood ratio
}
if($exbaryz!=0 && $xbaryz!=0)
{
$llr2 = $xbaryz/$exbaryz;
$llr2 = log($llr2)/log(2);
$llr2 = $xbaryz * $llr2;
$llrr += $llr2;
}
if($exybarz!=0 && $xybarz!=0)
{
$llr3 = $xybarz/$exybarz;
$llr3 = log($llr3)/log(2);
$llr3 = $xybarz * $llr3;
$llrr += $llr3;
}
if($exbarybarz!=0 && $xbarybarz!=0)
{
$llr4 = $xbarybarz/$exbarybarz;
$llr4 = log($llr4)/log(2);
$llr4 = $xbarybarz * $llr4;
$llrr += $llr4;
}
if($exyzbar!=0 && $xyzbar!=0)
{
$llr5 = $xyzbar/$exyzbar;
$llr5 = log($llr5)/log(2);
$llr5 = $xyzbar * $llr5;
$llrr += $llr5;
}
if($exbaryzbar!=0 && $xbaryzbar!=0)
{
$llr6 = $xbaryzbar/$exbaryzbar;
$llr6 = log($llr6)/log(2);
$llr6 = $xbaryzbar * $llr6;
$llrr += $llr6;
}
if($exybarzbar!=0 && $xybarzbar!=0)
{
$llr7 = $xybarzbar/$exybarzbar;
$llr7 = log($llr7)/log(2);
$llr7 = $xybarzbar * $llr7;
$llrr += $llr7;
}
if($exbarybarzbar!=0 && $xbarybarzbar!=0)
{
$llr8 = $xbarybarzbar/$exbarybarzbar;
$llr8 = log($llr8)/log(2);
$llr8 = $xbarybarzbar * $llr8;
$llrr += $llr8;
}
$llrr = 2 * $llrr;
return($llrr);
}
maha0084/tmi.pm0100666016600500004710000000537007551470721011540 0ustar maha0084package tmi;
require Exporter;
@ISA = qw ( Exporter );
@EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString);
# Aniruddha Mahajan
# CS8761 Fall 2002
# NLP Assignment 2
#
#True Mutual InformationBigrams
sub initializeStatistic
{
($ngrams, $totalbigrams, $freqarrindex, $freqarr) = @_; #initialize variables as passed by statistic.pl
}
sub calculateStatistic
{
@arr = @_; #get array
$totb = $totalbigrams;
#
$xy = $arr[0]; # no. of times xy bigram occurs
$x = $arr[1]; # no. of times x occurs
$y = $arr[2]; # no. of times y occurs
$xybar = $x  $xy; # calculate rest of constituents of matrix
$xbary = $y  $xy; # alongwith totals
$ybar = $totb  $y;
$xbar = $totb  $x;
$xbarybar = $ybar  $xybar;
# # now calculate P(m,n)*log[P(m,n)/P(m)*P(n)]
# for all constituents of the matrix
$Pxy = $xy/$totb; # P(x,y)
$Px = $x/$totb; # P(x)
$Py = $y/$totb; # P(y)
if($Px!=0 && $Py!=0)
{
$f1 = $Pxy/$Px;
$f1 = $f1/$Py;
if($f1!=0)
{
$f1 = log($f1)/log(2); # Pointwise MI
}
$f1 = $Pxy * $f1; # for True Mutual Info
$ff = $f1;
}
#
$Pxybar = $xybar/$totb;
$Pybar = $ybar/$totb;
if($Px!=0 && $Pybar!=0)
{
$f2 = $Pxybar/$Px;
$f2 = $f2/$Pybar;
if($f2!=0)
{
$f2 = log($f2)/log(2);
}
$f2 = $Pxybar * $f2;
$ff += $f2;
}
#
$Pxbary = $xbary/$totb;
$Pxbar = $xbar/$totb;
if($Pxbar!=0 && $Py!=0)
{
$f3 = $Pxbary/$Pxbar;
$f3 = $f3/$Py;
if($f3!=0)
{
$f3 = log($f3)/log(2);
}
$f3 = $Pxbary * $f3;
$ff += $f3;
}
#
$Pxbarybar = $xbarybar/$totb;
if($Pxbar!=0 && $Pybar!=0)
{
$f4 = $Pxbarybar/$Pxbar;
$f4 = $f4/$Pybar;
if($f4!=0)
{
$f4 = log($f4)/log(2);
}
$f4 = $Pxbarybar * $f4;
$ff += $f4;
}
#
$mi = $ff; # True Mutual Information
return($mi);
}
maha0084/user2.pm0100666016600500004710000000424307551573606012013 0ustar maha0084package user2;
require Exporter;
@ISA = qw ( Exporter );
@EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString);
# Aniruddha Mahajan  maha0084
# CS8761 Fall 2002
# NLP Assignment 2
#User2  Russell & Rao's Coefficient 
sub initializeStatistic
{
($ngrams, $totalbigrams, $freqarrindex, $freqarr) = @_; #initialize variables as passed by statistic.pl
}
sub calculateStatistic
{
@arr = @_; #get array
$totb = $totalbigrams;
#
$xy = $arr[0]; # no. of times xy bigram occurs
$x = $arr[1]; # no. of times x occurs
$y = $arr[2]; # no. of times y occurs
$xybar = $x  $xy;
$xbary = $y  $xy;
$ybar = $totb  $y;
$xbar = $totb  $x;
$xbarybar = $ybar  $xybar;
#
# Russell & Rao's Coefficient
#
$R = ($xy)/($xy+$xybar+$xbary+$xbarybar);
# Russell & Rao's Coefficient
return($R);
}
#Russell & Rao's Coefficient = a/(a+b+c+d)
# i.e. R = 2(xy)/(xy+xbary+xybar+xbarybar)
#
# To illustrate difference between Dice and Russell & Rao's coefficient...
# Conjoint absence is taken into consideration in the denominator whereas it is not so in Dice
#
# I compared the 2 to each other and also to leftFisher
# Here are the results 
# Rank dice user2 0.88065 > Dice and RussleRao's tests are different
# Rank dice leftFisher
# Rank user2 leftFisher
#
# user2 i.e.Russell & Rao's Coefficient produces good results with respect to collocation hunting.
#
# One reference to this coefficient is found at 
# http://www.soziologie.wiso.unierlangen.de/koeln/script/chap3.doc
#
# I have also implemented Jaccard's Test.
# Jaccard's coefficients is different from dice in the fact that in dice conjoint presence is
# given double weightage .. whereas in Jaccard's coefficient calculation it is not so.
# It is similar to Dice coefficient
#
#
maha0084/user3.pm0100666016600500004710000000264607551573555012024 0ustar maha0084package user3;
require Exporter;
@ISA = qw ( Exporter );
@EXPORT = qw (initializeStatistic getStatisticName calculateStatistic errorCode errorString);
sub initializeStatistic
{
($ngrams, $totaltrigrams, $freqarrindex, $freqarr) = @_;
}
sub calculateStatistic
{
@arr = @_; #get array
$tott = $totaltrigrams;
# xyz x y z xy xz yz
$xyz = $arr[0];
$x = $arr[1];
$y = $arr[2];
$z = $arr[3];
$xy = $arr[4];
$xz = $arr[5];
$yz = $arr[6];
$xbaryz = $yz  $xyz;
$xbarz = $z  $xz;
$ybarz = $z  $yz;
$xbarybarz = $xbarz  $xbaryz;
$xybarz = $xz  $xyz;
$xybar = $x  $xy;
$xbary = $y  $xy;
$xbar = $tott  $x;
$xbarybar = $xbar  $xbary;
$ybar = $tott  $y;
$xyzbar = $xy  $xyz;
$yzbar = $y  $yz;
$xzbar = $x  $xz;
$zbar = $tott  $z;
$xybarzbar = $xybar  $xybarz;
$xbaryzbar = $yzbar  $xyzbar;
$xbarzbar = $zbar  $xzbar;
$xbarybarzbar = $xbarzbar  $xbaryzbar;
$ybarzbar = $zbar  $yzbar;
#
$R = ($xyz)/ ( ($xyz+$xbarybar+$xbaryz+$xyzbar+$xybarz+$xbaryzbar+$xbarybarz+$xybarzbar)); # Russell & Rao's Coefficient
return($R);
}
# Russell & Rao's Coefficient is extended for trigrams easily due to linear relation in the calculation of the constituents which contribute towards
# the coefficient.
#
#
#
#
#
From maha0084@d.umn.edu Sun Oct 20 20:48:25 2002
Received: from gerard.d.umn.edu (gerard.d.umn.edu [131.212.41.24])
by mail.d.umn.edu (8.11.6/8.11.6) with ESMTP id g9L1lu829215
for ; Sun, 20 Oct 2002 20:47:56 0500 (CDT)
Received: from localhost (maha0084@localhost)
by gerard.d.umn.edu (8.11.6+Sun/8.11.6) with ESMTP id g9L1ltU01253
for ; Sun, 20 Oct 2002 20:47:55 0500 (CDT)
Date: Sun, 20 Oct 2002 20:47:55 0500 (CDT)
From: Aniruddha S Mahajan
To: ted pedersen
Subject: Re: experiments.txt
InReplyTo: <200210202055.g9KKt9R28080@csdev04.d.umn.edu>
MessageID:
MIMEVersion: 1.0
ContentType: TEXT/PLAIN; charset=USASCII
Status: RO
Hi Ted ...
Thanks for letting me know ... I am at a complete loss to explain how
it happened. I have several backup files and i am pasting my
experiments.txt file here. I owe you one...
Thanks,
Andy
# Aniruddha Mahajan
# CS8761 Fall2002
# Assignment 2
# experiments.txt
#
Introduction:
We have to identify 2 or 3 word collocations from a large
body
of text.
A collocation is a sequence of words that can be viewed as syntactic or
semantic
units.
Identification of proper collocations is important as we need collocations
which
are
semantically right. For example ",<>and" is a bigram and ",<>and<>the" is
a trig
ram,
but none among these can be counted as a collocation as they have no
'proper' me
aning
in the English language. The meaning of a collocation is not completely
defined
by simply
cojoining the meanings of it's constituents. Thus we want to identify
collocat
ions from
the list of bigrams/trigrams. It can be said that the test or coefficient
used s
hould
assign higher scores to such bigrams/trigrams.
A good test should throw up collocations or words which
occur t
ogether
(much?) more frequently than separately.
A corpus of 1.1 million words was used for all tests. This
was t
aken from
the Brown Corpora. The NSP package developed by Ted Pedersen and Satanjeev
Baner
jee was used
to implement all tests on the said corpus.
****************************
Experiment 1
****************************
TRUE MUTUAL INFORMATION
^^^^^^^^^^^^^^^^^^^^^^^
To implement "True Mutual Information" for 2 word sequences. True Mutual
Informa
tion is
calculated using the formula 
TMI = Summation[ P(XiYi) * log(P(XiYi)/P(X)P(Y)) ]
where log is to the base 2 and X and Y take the values {x,xbar} and
{y,ybar} res
pectively.
Upon implementing tmi and executing NSP we get a list of bigrams sorted by
rank.
These
bigrams reveal some collocations amongst a large number of bigrams which
cannot
be
classified as collocations. Some collocations like "New York" "U S"
(because of
the periods
U.S.) "United States" "years ago" "World War" "Los Angeles" "fiscal year"
"Gener
al Motors"
"Civil War" "Puerto Rico" "anti Semitism" etcetra appear in the top 50
ranks of
this file.
Many bigrams like ". The" "of the" "in the" ", but" "it is" ". He" do not
offer
much useful
information to us in this regard. However we do learn that tmi gives us
top occu
ring bigrams
and not collocations.
Method of execution and output ...
perl statistic.pl tmi.pm tmiop2.txt ctop2.txt
gives the output....(corpus of 1.1 Million words)
.<>The<>1 0.0142 4366 40218 6134
of<>the<>2 0.0109 8429 32697 54968
,<>and<>5 0.0047 4593 49461 23745
United<>States<>8 0.0033 341 450 441
New<>York<>15 0.0025 256 522 28
number<>of<>27 0.0012 331 452 32697
years<>ago<>31 0.0008 103 845 204
of<>his<>33 0.0006 616 32697 4901
Peace<>Corps<>34 0.0005 46 60 77
USER2 (Russel & Rao's Coefficient)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To calculate Russel & Rao's coefficient for bigrams.
R = a / (a + b + c + d)
where
R > Russel & Rao's coefficient
a > xy
b > xybar
c > xbary
d > xbarybar
Although the formula is similar to that os Dice coefficient... the two are
very
different.
Executing rank.pl on dice and rao gives a correlation coefficient of
0.8805 whi
ch means
that the two coefficients are quite opposite. Conjoint absence is taken
into co
nsideration
in the denominator whereas it is not so in Dice. RussellRao coefficient
does no
t produce
many collocations in the very top ranks, but they do appear in the top 50.
"numb
er of"
"United States" "Chinese Puzzle" "ultramarine blue" "straight
hair""Constable He
nry" are some.
However the RussellRao coeffecient produced a fair number of
collocations per
a set of
trigrams in the lower ranks. The RussellRao coefficient is always less
than 1.
(except when all the corpus consists of is the only bigram).
perl statistic.pl user2.pm user2op2.txt ctop2.txt
gives output like  (corpus of 1.1 Million words)
of<>the<>1 0.0074 8429 32697 54968
in<>the<>2 0.0042 4729 17359 54968
he<>had<>18 0.0004 439 4945 3526
number<>of<>19 0.0003 331 452 32697
United<>States<>19 0.0003 341 450 441
psychiatric<>knowledge<>22 0.0000 1 5 132
nuclear<>age<>22 0.0000 1 110 208
poem<>by<>22 0.0000 2 43 4754
EXAMPLE
of<>the<>1 0.0074 8429 32697 54968
the  thebar 

of 8429  24268  32697
 
ofbar 46539  1059081 1105620

54968  1083349 1138317
of the > R = 8429/(8429+24268+46539+1059081)
R = 0.0074
Thus calculated and observed value of Russel & Rao's Coeffecint is same.
Top 50 Comparison
^^^^^^^^^^^^^^^^^
Both the tests give similar results .. i.e. in output of both there are
some
bigrams which can be called collocations whereas there are many bigrams
which ca
nnot.
One more similarity is that the RussellRao coefficient and tmi both have
low va
lues
(below 1). thus in the whole CORPUS we get only 39 ranks in tmi and 22 in
user2.
In
both the tests the lower ranks contain more collocations.
Cutoff Point
^^^^^^^^^^^^
There is no such cutoff point observed in these tests. Both the tests
are una
ble to
pick out the collocations from the bigrams. Though we can say that a
cutoff poi
nt does
exist near the end where lots of collocations are found  for both tmi
and user
2.
There does exist a cutoff point for ll.pm  after like 500 ranks
collocatio
ns begin
to appear in the output and they become prominant in the output after 8000
ranks
.
Comparing ll , mi, tmi, user2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank ll tmi 0.9613
rank ll user2 0.9880
rank mi tmi 0.9613
rank mi user2 0.9878
rank dice user2 0.8805
rank tmi user2 0.9999
Overall Recommendation
^^^^^^^^^^^^^^^^^^^^^^
I think (after observing results from tests performed on CORPUS) mi is
the b
est
one for searching for collocations (in a corpus of similar size). ll is
also goo
d but the
very top ranks are dominated by bigrams which cannot be classified as
collocatio
ns.
E.g. ".<>The<>" "of<>the<>" ",<>but<>".
*******************************
Experiment 2
*******************************
Log Likelihood for Trigrams
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Log Likelihood was to be implemented for trigrams.
p(W1)p(W2)p(w3) = p(W1,W2,W3) is the given Null Hypothesis to
implement ll3.
Using the same ...
P(xyz) = P(x)P(y)P(z)
exyz/tott = x/tott * y/tott * z/tott
exyz = (x*y*z)/(tott*tott)
Where exyz is expected value of xyz.
Thus we get formula for expected value in Log Likelihood ratio. Now, Log
Likelih
ood
can be calculated by the following formula ...
L = 2* Summation[XYZ log(XYZ/eXYZ)]
where.. L = Log Likelihood Ratio
XYZ = Observed value of XYZ
eXYZ = Estimated value of XYZ
> X,Y,Z take values (x,xbar), (y,ybar), (z,zbar) respectively.
> log is taken to base 2
command used to run NSP on ll3...
perl statistic.pl ngram 3 ll3.pm ll3op3.txt ctop33.txt
Results obtained were as follows 
Almost every trigram had a different rank. ll3 showed sound results
with some
collocations
springing up in the top 50. In the later rank (>7000).
,<>.<>The<>1 36937.9893 7 49461 40218 6134 56 30 4366
of<>.<>The<>2 35662.0687 1 32697 40218 6134 18 1 4366
.<>The<>.<>3 34622.8955 1 40218 6134 40219 4366 293 1
Metropolitan<>Rhode<>Island<>52236 2005.5033 1 12 88 124 1 1 76
the<>first<>half<>52539 1958.5829 8 54968 1129 291 452 29 11
USER3  Russell & Rao's coefficient
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Russell & Rao's coefficient was extended to Trigrams. The formula is
linear
so it can
easily be extended to incorporate more variables.
R = xyz / (xyz + xyzbar + xybarz + xbaryz + xybarzbar + xbaryzbar +
xbarybarz
+ xbarybarzbar)
One thing is that RusselRao's coefficient is always less than one and
in thi
s case it is
pretty small .. thus we get very few ranks as NSP ranks extend only upto 4
decim
al places.
One solution may be to scale the output by a factor of 1000 or something.
The ou
tput produces
results.. "front page stories" "day by day" "and later patented".
Collocations w
hich have a definite meaning
are harder to get for trigrams in these cases atleast.
perl statistic.pl ngram 3 user3.pm user3op3.txt ctop33.txt
Results obtained were ..
,<>and<>the<>1 0.0004 443 49461 23745 54968 4593 2388 1785
.<>It<>is<>1 0.0004 405 40218 1759 9559 1378 1029 526
the<>number<>of<>4 0.0001 84 54968 452 32697 106 7513 331
parked<>in<>front<>5 0.0000 3 28 17359 164 4 3 68
Day<>by<>day<>5 0.0000 1 52 4754 575 1 3 7
brief<>glimpses<>of<>5 0.0000 1 60 4 32697 1 6 3
Example
,<>and<>the<>1 0.0004 443 49461 23745 54968 4593 2388 1785
R = (443)/(443+1342+51238+1945+4150+42923+17810+1018466)
xyz xbaryz xbarybarz xybarz xyzbar xybarzbar xbaryzbar
xbarybarzba
r
R = 443 / 1138317
R = 0.00038 ~ 0.004
Thus Estimated value equals observed value
Top 50 Comparison
^^^^^^^^^^^^^^^^^
Both the tests (ll3 and user3) start off with giving highest ranks to
trigra
ms which occur most frequently.
Trigrams with punctuation marks, articles like 'the' and 'a' also the word
'and'
occurs most frequently in the
top ranks. After that however around the 5th rank (user 3 has limited
ranks .. b
ut still it is quite early) user3
i.e. Russell & Rao's coefficient shows some collocations and trigrams
which are
not exactly collocations but are
'in betweens'. ll3 continues to be partial towards articles, "and" etc.
ll3 shows decent results prety late  but then it has got a different
rank fo
r almost every trigram. In the
top 50 (with scaling for user3) user3 gives more collocations and
reasonable wor
dsequences than ll3.
Cutoff Point
^^^^^^^^^^^^
Again, without scaling user3 has a cutoff point of 5 ... which is
around 800
0 trigrams out of a million.
After this point it begins to show meaningful word sequences with
collocations s
cattered around. I did not find any
order in the occurance of collocations as such.. but then again 3 word
sequences
which could be called collocations
were themselves very less. ll3 also exhibits cutoff but gradually over
ranks in
the range of 10000.
Rank Comparison
^^^^^^^^^^^^^^^
I am using this comparison to show that user3 retained its qualities
after e
xtension for trigrams. Also to show
that RussellRao's and dice coefficients are different in nature.
Rank user2 dice 0.8805 > user2 & dice are different
in natu
re
Rank ll user2 0.9994
Rank ll3 user3 0.9880 > user3 retained correlation
coeffic
ient
thus implementation is valid
Overall Recommendation
^^^^^^^^^^^^^^^^^^^^^^
It is difficult to pick out one of the two tests. Neither of them is
very goo
d at selecting collocations from the corpus.
Still user3 has a definite cutoff point, while ll3 has a more gradual
change o
f trigram quality. I would pick Russell &
Rao's coefficient because I feel that it gives better collocations even if
they
are low ranked.

On Sun, 20 Oct 2002, ted pedersen wrote:
>
> Hi Andy,
>
> Your experiments.txt file seems to be a tar file that consists of your
> .pm files. Do you have a copy of your report somewhere? If you do, go
> ahead and send it to me as plain text via email. Let me know if you
> don't...
>
> Thanks,
> Ted
>
> 
> # Ted Pedersen 7268770 #
> # tpederse@umn.edu http://www.umn.edu/~tpederse #
>
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
Typed and Submitted by: Ashutosh Nagle
Assignment 2: The Great Collocation Hunt
Course:CS8761
Course Name: NLP
Date: Oct 11, 2002.
++
 Introduction 
++
In this assignment, I am interested in the finding
and implementing the tests of association for bigrams and
trigrams.I have implemented 4 such tests and found out some
interesting relation among some of these.
########################################################
# #
# http://www.cs.brown.edu/~dpb/papers/dpbcolloc01.pdf #
# #
########################################################
Given above is the url where I found this test. The
test given here is applicable to all the ngrams. I have used
this for both bigrams and trigrams.
This test is based on "Loglinear model of nway
interaction" (as stated in section 2.3 of the above document).
The above file gives 2 different measures of association namely,
"unigram subtuples" and "all subtuples". I have used the "all
subtuples" in my implementation. The details of how these measures
are derived are available on the web. Given below are the details
about how I calculate these measures based on the information in
the above pdf file.
Let the trigram be A1<>A2<>A3, in general. For a
particular trigram a1<>a2<>a3, A1=a1, A2=a2 and A3=a3. A1, A2 and
A3 are the random variables. The above said test considers all
possible combinations of equalities and inqualities like this
(i.eAi=ai or Ai != ai). It represents every such combination by a
3 bit integer, with a 1 for every equality and a 0 every
inquality. For example for the above trigram, the test makes 8
such combinations like
1) (A1!=a1)&&(A2!=a2)&&(A3!=a3),
2) (A1!=a1)&&(A2!=a2)&&(A3=a3),
3) (A1!=a1)&&(A2=a2)&&(A3!=a3),
..........and so on.
The 3bit inter corresponding to the first combination
above will be 000 = 0, second 001 = 1, third 010 = 2 (binary to
decimal conversion), and so on. The next variable that the measure
defines is Cb, which is the number of times the combination
represented by a particular b occured in the input text. For
example C5 would mean the number of times the combination
(A1=a1)&&(A2!=a2)&&(A3=a3) occured in the input text because 5 represents
the bit pattern 101.
The point that I noted about this measure is it makes even the
inequalities explicit. By that I mean when C5 is calculated, attention
is paid not only to the occurance of A1=a1 and A3=a3, but also to the
absence of A2=a2 i.e. A2!=a2. In NSP, the values that we get, include
all the possibilities i.e. we get the number of trigrams in which
(A1=a1 and A3=a3), but in these trigrams A2 may or may not be 'a2'.
Because of this the value of C5 is computed as
C5 = N(A1=a1 && A2=a2 && A3=a3)  N(A1=a1 && A3=a3)
N represents the number of time the tuple inside the parentheses occured.
Both the values are available from count.pl.
Then the model computes lambda as
Upperlimit UL = (2 raised to power n)  1, where n is 3 for trigrams
NoO = Number of 1's in a particular 'b'.
Lambda = Summation  b=0 to UL { [(1) raised to power (nNoO)] * [log (cb)]}
Please take a look at the formula on the web site.
The last variable that the model defines before calculating the measure is
sigma, which is equal to
sigma = Square root of {Summation  b=0 to UL (1/cb) }
Finally, the measure Mu is calculated as
Mu=Lambda = 3.29 * sigma.
______________________________
Sample calculation for user3.

Let's take an example of the following values being passed from count.pl's
output by statistic.pl.
(2,3,2,2,2,2,2) = (c7, c4, c2, c1, c6, c5, c3)
Say total number of bigrams in the file = 10.
Here, the various values are
Left most parentheses use the convention used in NSP.
(0,1,2)=c111 = c7  Because all the words occur. [2]
(0)= c100 = c4  But this contains values c4 + c5 + c6 + c7
as explained above [3]
(1)= c010 = c2  But this contains values c2 + c3 + c6 + c7
as explained above [2]
(2)= c001 = c1  But this contains values c1 + c3 + c5 + c7
as explained above [2]
(0,1)= c110 = c6  But this contains values c6 + c7
as explained above [2]
(0,2)= c101 = c5  But this contains values c5 + c7
as explained above [2]
(1,2)= c011 = c3  But this contains values c3 + c7
as explained above [2]
let c0 = Total number of trigrams
Purify the values by removing the additional factors
c3 = c3  c7 = 0;
c5 = c5  c7 = 0;
c6 = c6  c7 = 0;
c1 = c1  (c3 + c5 + c7) = 0;
c2 = c3  (c3 + c6 + c7) = 0;
c4 = c4  (c5 + c6 + c7) = 1;
Now let's add the continuity correction factor 0.5 to all the c's
Therefore, c1 = c2 = c3 = c5 = c6 = 0.5
c4 = 1.5 c7 = 2.5
Now let's calculate the number of i's in the binary representation
of integers between 0 and 7.
b0 = 000 = 0
b1 = 001 = 1
b2 = 010 = 1
b3 = 011 = 2
b4 = 100 = 1
b5 = 101 = 2
b6 = 110 = 2
b7 = 111 = 3
now let's make bi = n  bi, where n is 3 for trigrams
b0 = 3
b1 = 2
b2 = 2
b3 = 1
b4 = 2
b5 = 1
b6 = 1
b7 = 0
Lambda = 0
sigma = 0
for i moving from 0 to 7 do
{
multiplier = (1) to the power bi;
term = log (ci)
term = term * multiplier
lambda = lambda + term
sigma = sigma + sqrt(1/ci)
}
Using the above algorithm the values of term * multiplier for
the various ci's are
c0 > 2.0149
c1 > 0.6931
c2 > 0.6931
c3 > 0.6931
c4 > 0.4054
c5 > 0.6931
c6 > 0.6931
c7 > 0.9162
Adding the above values gives us lambda = 0
Similarly 1/ci values are
c0 > 0.1333
c1 > 2
c2 > 2
c3 > 2
c4 > 0.6666
c5 > 2
c6 > 2
c7 > 0.4
Adding the above values gives us sigma = 11.2
Finally the measure
Mu = lambda  3.29 * sigma = 0  3.29 * 11.2
= 36.848
The same formula is applicable to bigrams. The only difference
is that we have 4 c's and 4 b's instead of 8.
______________________________
Sample calculation for user2.

The example is for the values (2, 3, 2) = (x1, x2, x3) returned
from count.pl and total #(bigramgs) = 11
Purifying the values
c3 = x1 = 2
c2 = x2  x1 = 3  2 = 1
c1 = x3  x1 = 2  2 = 0
c0 = total  (c1 + c2 + c3) = 11  (2 + 1 + 0) = 8
After addign the continuity correction factor 0.5
co = 8.5
c1 = 0.5
c2 = 1.5
c3 = 2.5
Then,(bi is the number of one's in the binary representation of i)
b0 = 00 => 0
b1 = 01 => 1
b2 = 10 => 1
b3 = 11 => 2
n = 2 for bigrams
n  b0 = 2
n  b1 = 1
n  b2 = 1
n  b3 = 0
the values of term * multiplier for all ci's are
c0 > 2.1400
c1 > 0.6931
c2 > 0.4054
c3 > 0.9162
The sum of these values gives us lambda = 3.3440
For Mu:
1/c0 =0.1176
1/c1 = 2
1/c2 = 0.6667
1/c3 = 0.4
Sum of these values gives us sigma = 3.1843
Finally thea measure Mu = lambda  3.29 sigma
= 3.3440  3.29 * 3.1843
= 7.1323
++
 Experiment 1: bigrams 
++
The coupus used is BTL.TXT from BROWNTAG
TOP 50 COMPARISON:
Looking at TOP 50 ranks in the files generated by
user2 and tmi, I feel that user2 is much better in identifying
the significant or interesting collocations. I have placed both
these files in /home/cs/nagl0033/CS8761 directory.
From the formula of tmi, I feel if p(x,y) is too high,
it has a major impact on the rank of the bigram. Even when x and y
occur only together say n times, the ratio of p(x,y) to [p(x)*p(y)]
is 1/n. At all other times it has to be less than this. Depending on
n we get some negative value (large or small) of log. We multiply this
with p(x,y) and sum for all the bigrams. So I really do not see any
correlation between this and the strength of the association between x
and y.
On the other hand, as per the information given in the
above mentioned pdf file, the bigrams that score poorly on the above
test are name like bigrams e.g. Nitin Sharma. The words in such
bigrams occur only in the particular names. So if the first word occur
we know what the next word is going to be. On the other hand, words that
occur together as well as separately, quite often score well in the above
test. It certainly makes sense that we are likely to be interested in such
combinations, rather than the the ones whose words do not occur outside
the combination.
CUT OFF:
I do not see any cut off in the output files. But I realise that
bigrams given by tmi are initially totally meaningless, then some sensible
ones again meaningless. In the output file of user2, I feel that the bigrams
are mostly meaningful and they become meaningless deep down the file.
RANK COMPARISONS:
_______________________________________
  user2  tmi 

tmi  0.8685  
mi  0.8076  0.8688 
dice  0.8556  0.6971 
x2  0.8486  0.8681 
ll  0.7277  0.8676 
______ ______________ _______________
Thus from these values, user2 is closer to all the tests, but tmi is
not. It gives negative values in all the comparisons. This strengthens my
belief that tmi should not be a measure of association. I made this belief
after seeing the unrelated and meaningless bigrams in the output file of tmi.
OVERALL RECOMMENDATIONS:
From the experiment I conclude that tmi is not a good measure of
association. Throught the experiment, tmi has always proved this.
e.g. through its output file, through the negative reluts nearing 1 in the
output of rank.pl.
Regarding the remaining ones, I do not feel that some measure is much
better than others because they all have given the values in the same range.
Also there output files are not observed to have any outstanding feature.
++
 Experiment 2: trigrams 
++
TOP 50 COMPARISON:
Looking at TOP 50 ranks in the files generated by
user3 and ll3, I feel that user3 is much better in identifying
the significant or interesting collocations. I have placed both
these files in /home/cs/nagl0033/CS8761 directory.
CUT OFF:
In file generated by ll3, I get all useless trigrams near the begining
of the file. e.g. a sample trigram with all its values is
.<>SD11<>:<>81 68480.7650 11 17965 11 9813 11 9002 11
Most of the trigrams upto rank 84 (score 68393) are like this. But from here onwards I start seeing some meaningful words.
With user3 I get meaningful values till rank 1265.
RANK COMPARISONS:
_______________________
  user3 

ll3  0.1315 
______ ______________ 
For this value I say, the two measures user3 and ll3 and not quite similar. Though they are not exactly opposite, they are not similar either.
OVERALL RECOMMENDATIONS:
From the experiment I conclude that user3 is a good measure of associaton as indicated by its output file. ll3 should also be good but may not be as good as user3. This conclusion is purely based on the output files. I understand that there could very well be some counter expample(s).
Computation.
In ll3 a value ni,j,k has the same meaning except it is for trigrams.
For the count.pl we get the following 8 values
All the trigrams = n+++
f(0,1,2), f(0), f(1), f(2), f(0,1), f(0,2), f(1,2) respectively mean
n000, n0++, n+0+, n++0, n00+, n0+0, n+00
from these I calculate the remaining 19 (27  8) values as
n1++ = n+++  n0++
n+1+ = n+++  n+0+
n++1 = n+++  n++0
These were the values with two ++.
Similary for just 1 +
n+01 = n+0+  n+00
n+10 = n++0  n+00
n+11 = n++1  n+01
in the similar way n0+1,n1+0, n1+1, n01+, n10+, n11+ are computed.
Now,
n011 = n00+  n000;
Take any position containing a 1 in the bit stream (here position say 3 in 011). Change it to a '+' to get the first term in the right hand side and change it to a '0' to get the second term in the right hand side.
Care is taken that the values are computed in such an order that they have all the required values computed before them. So any position with a 1 can be selected.
Similarly,
n001, n010, n100, n101, n110, n111 are compued.
Once this is done
eijk = (ni++)*(n+j+)*(n++k)/[(n+++)*(n+++)]
This formula is derived in 2 steps from the null htpothesis given.
P(x,y,z) = P(x)*P(y)*P(z);
Once all the eijk and nijk are computed they are substituted in the formula
G square = 2 Summation over all i,j,k (nijk * log (nijk/eijk));
Conclusion:
I conclude from the above experiments, that tmi is not a good measure of association, but user2, user3 and ll3 are quite acceptable
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
CS 8761 NATURAL LANGUAUGE PROCESSING
FALL 2002
ANUSHRI PARSEKAR
pars0110@d.umn.edu
ASSIGNMENT 2 : THE GREAT COLLOCATION HUNT
Introduction

The objective of this assignment is to find, implement and compare various measures
of association for bigrams and trigrams, which can be used to identify collocations, in a
large corpus. Collocations are sequences of words whose meaning is not same as the meaning
obtained by joining the meanings of its constituent words. Various methods can be used to
find out whether a sequence of words isa collocation or not.Comparing frequencies of sequences
of words when they appear together and when they appear alone can give us a fair idea about
the chances of that sequence is a collocation or not. If the words of a sequence appear alone
very frequently as compared to appearing together, then it may not be a collocation. Based on
this various measures of association such as pointwise mutual information, logliklihood ratio,
dice's coefficeint etc. are used to identify collocations.
Input details:
All the four modules that were implemented were ran on a corpus obtained by concatenating
files taken from the BROWN corpus. The total number of words in the corpus were 1039486.
************************
* Experiment no:1 *
************************
True Mutual Information

Mutual information is the amount of information one random variable contains about other.
In case of bigrams mutual information can be viewed as the reduction in uncertainity of
occurence of a word knowing about the occurence of another word [1]. Hence if two words occur
independent of each other their mutual information will be zero and if occurence of two words
is highly dependent then their mutual information will be high. Therefore mutual information
can be used as a measure of association.
Mutual Information = summation over x,y (p(x,y) * log (p(x,y)/(p(x)p(y)) )
User2  Mutual Expectation

This test of association works on textual units(word or character or tag) for an ngram.
Mutual expectation evaluates the degree of cohesiveness that links together the textual units
and is based on the concept of normalised expectation [2]. Normalised expectation refers to
the expectation of occurence of a word knowing the presence of rest of the words in the
ngram.Intutively,NE is based on conditional probability, however for an ngram the probability
that a given word will occur knowing the rest (n1) words has to be calculated. Therefore
the concept of Fair point expectation is used which refers to the arithmetic mean of the n
joint probabilities of the (n1)grams contained in the ngram [2].Hence mutual expectation is
calculated as [3] :
mutual expectation = joint prob of n words * normalised expectation
where,
normalised expectation = joint prob of n words / fair point expectation
Example: United<>States<>5 0.2275 343 453 443
Fair point expectation = (p(x)+ p(y))/2;
= (453 + 443)/2
= 448
Normalised expectation = p(x,y)/FPE
= 343/448
=0.765625
Mutual expectation = p(x,y) * NE
= 343 / 1154563 * NE
= 0.2275 (after scaling by 1000)
Top 50 Comparisons


part of output after running user2 part of output after running tmi

United<>States<>5 262.6094 343 453 443 United<>States<>8 0.0033 343 453 443
New<>York<>8 162.4188 256 522 285 New<>York<>16 0.0024 256 522 285
number<>of<>197 6.6256 333 455 33018 number<>of<>28 0.0012 333 455 33018
even<>though<>198 6.6092 61 812 314 even<>though<>36 0.0004 61 812 314

Comparing the results of tmi.pm and user2.pm it can be seen that both the tests lack to
identify actual collocations.Proper nouns are given higher ranks in user2 as compared to
tmi. However, user2 does not give high ranks to punctuation marks and prepositions as
compared to tmi. The setback for user2 lies in the fact that it gives very high ranks to
common phrases such as number of,even though. THerefore tmi seems to be better than user2
for callocation extraction. However there wasn't any cutoff point that could easily divide
the output into significant and unsignificant group of words for both tests.
Rank Comparison

The corelation coefficient between true mutual information and and pointwise mutual
information was found to be 0.9618.

part of output after running mi part of output after running tmi

Burly<>leathered<>1 20.1389 1 1 1 .<>The<>1 0.0141 4419 41005 6204
Capetown<>coloreds<>1 20.1389 1 1 1 of<>the<>2 0.0108 8506 33018 55619
Bietnar<>Haaek<>1 20.1389 1 1 1 in<>the<>3 0.0062 4764 17511 55619
aber<>keine<>1 20.1389 1 1 1 .<>He<>4 0.0057 1650 41005 2053
drown<>emanations<>1 20.1389 1 1 1 .<>It<>5 0.0047 1400 41005 1791
blandly<>boisterous<>1 20.1389 1 1 1 ,<>and<>5 0.0047 4647 50254 24023
United<>States<>8 0.0033 343 453 443

Pointwise mutual information extracts those pairs of words which may occur a few times
individually butalways occur together. On the other hand, true mutual information(TMI) gives
high score to those pairs of words which occur together for a large number of times and may
occur individuallyfor a large no of times too (without the other word). TMI can identify
proper nouns like United States, Rhode Island and terms like per cent, at least which occur
together. However TMI gives high ranks to combinations of punctuation, articles and prepositions
which are not given high ranks in pointwise mutual information. Hence it can be argued that
both these tests of association work in highly different ways.
The ranks obtained by tmi and log liklihood are nearly the same except that the scores
by logliklihood test are far larger to that obtained by tmi.This is because log liklihood
test can be thought about as a reformulation of mutual information.

part of output after running mi part of output after running user2

GHOST<>TOWN<>1 20.1389 1 1 1 GHOST<>TOWN<>955 1.0000 1 1 1
BOA<>CONSTRICTOR<>1 20.1389 1 1 1 boa<>constrictor<>152 0.0039 1 1 1
MELTING<>POT<>1 20.1389 1 1 1 MELTING<>POT<>182 0.0009 1 1 1

Comparing the results of user2 with mi show that mi is more efficient in finding the
collocations and collocations may get very high ranks in case of user2. However mi tends to
give a single rank to a lot of pairs. The corelation coefficient is 0.6777.

part of output after running logliklihood part of output after running user2

of<>the<>2 17310.4232 8506 33018 55619 of<>the<>1 1.4140 8506 33018 55619
United<>States<>9 5281.8724 343 453 443 United<>States<>5 0.2275 343 453 443
New<>York<>20 3909.0139 256 522 285 New<>York<>8 0.1407 256 522 285

The results of logliklihood are nearly as same as that of user2. User2 gives a somewhat
better performance for proper nouns such as New York.
Overall Recommendation:

After observing all the 4 results, pointwise mutual information test identifies
collocations better than any of the rest. Since true mutual information for two dependent
variables grows not with the degree of dependence but also according to entropy of the
variables [1], it cannot give appropriate representation of the dependence of two words
as in a collocation, hence it gives very high ranks to collocations. The user2
(Mutual expectation test) is expected to perform better for ngrams.
*********************************
* Experiment No :2 *
*********************************
Log Liklihood Ratio

Log likelihood ratio is performed on a 3 word sequence assuming the following null
hypothesis (say words are x,y,z) :
p(x,y,z) = p(x) * p(y) * p(z)
The expected value e(x,y,z) assuming above hypothesis is given by:
e(x,y,z) = ( n(x) * n(y) * n(z)) / (total trigrams * total trigrams) )
where,
n(x) ,n(y) ,n(z) = observed values of frequeny of x,y,z respectively
The log likelihood ratio LL is calculated as:
LL = 2 * summation over x,y,z (n(x,y,z)* log(n(x,y,z) / e (x,y,z) )
User3 Mutual Expectation

The concept of mutual expectation can easily be applied to trigrams.
As mentioned before,this test of association is based on how well the ngram
accepts the loss of one of it's components [2]. THe formulae are as explained
in user2.
Top 50 Comparisons

Looking at the top 50 trigrams given by logliklihood and user3 a number of
significant collocations are obtained from user3. Some are listed below.

part of output after running user3

the<>United<>States<>1 0.1877 265 55619 453 443 351 278 343
as<>well<>as<>3 0.0996 199 5983 772 5983 273 547 213
per<>cent<>of<>20 0.0257 51 379 158 33018 137 70 56
Mr<>and<>Mrs<>21 0.0255 40 735 24023 469 44 40 79
the<>White<>House<>22 0.0252 39 55619 113 185 46 52 59
more<>or<>less<>34 0.0187 25 1920 3796 395 26 25 36
one<>of<>the<>36 0.0179 253 2663 33018 55619 471 311 8506
with<>respect<>to<>36 0.0179 37 6017 123 22443 50 95 54

On the other hand, loglikelihood gives very high ranks to trigrams with
punctuation and does not help significantly in collocation extraction.Some top
trigrams as given by the ll3 module are listed below.

part of output after running ll3

,<>.<>The<>1 37376.3913 7 50254 41005 6204 57 30 4419
of<>.<>The<>2 36039.5281 1 33018 41005 6204 19 1 4419
.<>The<>.<>3 35007.9173 1 41005 6204 41006 4419 302 1
to<>.<>The<>4 34680.5109 2 22443 41005 6204 37 4 4419
a<>.<>The<>5 34526.7254 1 18789 41005 6204 15 1 4419

Cutoff Point

The cutoff point for user3 output can be observed around 120 to 150. however it is
not a very sharp cutoff and some collocations maybe found beyond this limit. No such limit
for ll3 was observed.
Overall Recommendation

Overall, user3 (mutual expectation) outperforms ll3 for finding collocations in
trigrams.This maybe due to the fact that it considers the joint probabilities of all the
(n1) grams in a ngram. However, pointwise mutual information is better than mutual expectation for
bigrams.
References:

[1] Manning C., Schutze H., "Foundations of Natural Language Processing".
[2] Dias G., S.Guillore and J.G.P. Lopes " Extracting Textual Associations in PartofSpeech
Tagged Corpora".
http://nl.ijs.si/eamt00/proc/Dias.pdf
[3] Dias G., S.Guillore and J.G.P. Lopes " Mining Textual Associations in Text Corpora".
http://www.cs.cmu.edu/~dunja/KDDpapers/Dias_TM.rtf.++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
##############################################################################
ASSIGNMENT 2 THE GREAT COLLOCATION HUNT
REPORT BY  AMRUTA PURANDARE
##############################################################################

OBJECTIVE

To investigate various measures of association which can be
used to used to identify collocations in large corpora of text.
Identify and Implement a measure that can be used with 2 or 3 word
sequences and compare the results with other standard measures.
##############################################################################
=========================================================================
SOME MATHEMATICAL AND STATISTICAL CONCEPTS
USED FOR THE VARIOUS EXPERIMENTS CONDUCTED HERE
=========================================================================

Visualising 3 D data and mapping or transforming it to 2D space.

While dealing with the 3 Variable experiemnts, the following thoughts
were given while anaysing the problem of identifying and visualising
the relationship between 3 Random Variables. The random variables
here are the words in the context which form a trigram. Thus for a
trigram W1 W2 W3 which is a trigram of 3 words occuring in this
particular sequence. The way we build a contingency table for
tabulating the relation between 2 variables (words) using the
contongency tables, we try to visualise and analyze the same relation
for the set of 3 variables(words).

VISUALISING 3 VARIABLES WITH VENN DIAGRAMS

To translate the information about the relation between 3 words we
can draw a Venn Diagram with 3 intersecting circles showing 7
possible regions.
These regions are classified as
R1 #Occurances of all the 3 words together
#Number of times this trigram occurs
= n111 value in the contingency table
R2 #Occurance of 2 words in the trigram together and at their
#Number of trigrams which contain only 2 of the 3 words in the
trigram (at its right position/index)
right position but absence of the third word in the trigram.
These are the sections common to two circles but not to all
three.
= n112 , n211, n121 values in the contingency table.
where 1 at ith position indicates the presence of the word at
ith location in the trigram. And 2 indicates that the word is
not present.
e.g. looking at n112, we say that we have a trigram that looks
like
W1 W2 _ where W1 and W2 occur at 1st and 2nd but
W3 doesn't occur at the third position.
All the values n112, n121, n211 show the regions in Venn
diagram common to two circles which represent those words.
R3 #Occurance of any one word in the trigram.
These are the sections of the circles which belong to only
on circle and thus represent the occurance of a word without
any of the other words in the trigram.
= n122, n212, n221 show that only the word corresponding
to 1 at ith position occurs.
eg n122 shows the string W1 _ _ which are the trigrams
starting with W1 but not followed by W2 or W3.
R4 #Occurance of none of the words
This is the section of the Venn Diagram that belongs to no
circle indicating all the trigrams which dont contain either
of the three words in the trigram.

BUILDING A CONTINGENCY TABLE FROM THE FREQUENCY COMBINATIONS

Using the Venn Diagram, we arrive at the deerivation of following
formulae
The trigram is the combination of
0 1 2
0
1
2
0 1
1 2
0 2
Where these are actually
f(0 1 2) = n111 = All together
f(0) = n1pp = Total Occurances of W1
at 1st position = W1**
f(1) = np1p Same as above = *W2*
f(2) = npp2 = **W3
f(0 1) = n11p = Total occurances of pairs W1W2*
which may be followed by W3.
f(1 2) = np11 = same as above for *W2W3
f(0 2) = n1p1 = same for W1*W3
These are the basic values which help us to derive the
entire contingency table for 3 Variables.

DETERMINING OTHER CELLS

We represent the contingency table values as
n111 = All words together = n111
n112 = W1W2 not followed by W3 = n11pn111
n121 = W1(!W2)W3 = n1p1n111
n122 = W1(!W2)(!W3)
= n1pp  n112  n121 + n111
n211 = (!W1)W2W3 = np11n111
n212 = (!W1)W2(!W3)
= np1p  n112  n211 + n111
n221 = (!W1)(!W2)W3
= npp1  n121  n211 + n111
n222 = (!W1)(!W2)(!W3)
= n  n122  n212  n221  n112  n211  n121  n111

DRAWING A CONTINGENCY TABLE FROM THIS

To draw a contingency table for the above data we represent
W1 and W2 as rows and columns of the table.
And in each of these cells, we place 3 possible values of W3
Contingency Table
W2 W2'
W1 n111 n121
n112 n122
W1' n211 n221
n212 n222
Where all the Top values in each cell represent W3
and all the bottom values represent W3'
##############################################################################
=======================================
BIPREDICTIVE EXACT TEST
FREEMANHALTON TEST
========================================
===========
REFERENCE
===========
[Miller] 
EXACT TESTS FOR SAMPLE 3X3 CONTINGENCY TABLES WITH EMBEDDED FOUR
FOLD TABLES, Matthias Miller, German Genral of Psychiatryi,
strabe 8, D55131
We implement here a bipredictive exact test to find the
the degree of association between two words.

HOW AN EXACT TEST DIFFERS FROM A CHISQUARE TEST

Chisquare tests are unsuitable when the number of observed and
expected values are very small. The exact tests differ from the
chisquare test in that these are based on the cumulative
point probabilities of all possible contingency matrices
and compare the cumulative probability value with some
predefined threshold alpha = 0.05 based on which the decision
whether to accept or reject a null hypothesis is taken

Procedure

The procedure to find the cumulative probability involves the
calculaton of the point probabilities of all possible contingency
matrices and adding only those which are same or as high as the
observed point probability.
Thus from a given contingency table, we first find its point
probability n11/n++ for a 2x2 contingency table which is
the probability of finding the two words together.
To find all possible contingency tables we use the following algo
which could be easily generalised to get the possible cell values
Consider a contingency table, for 2 RANDOM VARIABLES
X1 X2 X3

Y1 3 4 5 y1
Y2 4 5 5 y2
Y3 2 2 4 y3

This shows the relationship between two random variables (X,Y)
each one of which can take 3 possible values.
Note that the marginal values are fixed as the cell values are
variables but for a given number of observations, N, number of
instances of getting each category/type of the random variables
is fixed. Thus for a fixed marginal indices, we find all possible
contingency matrices by varying the cell values and computing
a point probability for each of the combination.

FINDING ALL POSSIBLE COMBINATIONS

Lets first consider an example for 2X2 matrix as its easier to
begin with and then we extend the algorithm for nXn tables.
or even for nXnXn cubes (for 3 variables)
3 4  Y1=7
2 3  Y2=5

X1=5 X2=7 12
From this, we note that the first cell belongs to two groups
namely, X1 and Y1. Hence the cell values in (1,1) can be varied from
0min(X1,Y1) which is [05]. This is intuitive as the cell value
can't be > 5 if the X1 has to be 5. (from n11+n21=X1, X1>=n11,n21)
Hence, we say that the value of the cell (1,1) can be chosen from
[05] interval and foreach value of n11 we get a new contingency
table. As the other values can be calculated w/o any prediction
or assumption, we have for 2X2 tables, min(X1,Y1) = #possible tables.
e.g.
T1  T2  T4  T5
0 7  1 6  4 3  5 2
5 0  4 1  1 4  0 5
shows 5 possible tables apart from the observed table.
From this we calculate for each of the tables, a point probability
which is n11/n++ and we take cumulative adition of all those
which are same or smaller than the observed value.
In this case, it would be <= 3/12 which is the oberved value.
We see only tables, T4 and T5 follow this criteria and hence
for these we calculate the phi (coefficient of partial association)
as we can not simply take the cumulative probability to accept/reject
the hypothesis in this particualr program. Which is expected to
return the test score.
The test score is calculated as the
phi= SUM (n11*n22n12*n21)/(n1+ * np1 * n2p * np2 * (n11+n12)/(n++))
[Miller] for 2 variables
and
phi= SUM (P(x,y,z)(p(x)*p(y)*p(z)))/(n1pp * np1p * n2pp * np2p * npp1* npp2)**1/3;
[Miller]
These scores give us an idea of the partial association between the
variables for
Point Probability of the Table < Observed Point probability
The Logic of finding all possible contingency tables can be extended
to a general case when either we have rxc contingency table or
NxMxP 3D space for 3 Variables (This is discussed in user3.pm).
Re  consider our first example for 3x3 contingency table,
3 4 5 Y1=12
4 5 5 Y2=14
2 2 4 Y3=8

X1=9 X2=11 X3=14 N=34
From this data, we note that the fixed marginal values associated
with the cells. Or each cell belongs to 2 groups in 2D structure
n11 => X1,Y1
n12 => X2,Y1
n13 => X3,Y1
n21 => X1,Y2
n22 => X2,Y2
n23 => X3,Y2
n31 => X1,Y3
n32 => X2,Y3
n33 => X3,Y3
We start with cell (1,1) forwhich the possible range is [09]
We vary the value of n11 from 0 to 9
Foreach value of n11, we find range of n12 such that
the rule of marginal totals holds true
i.e. n12+n13 = Y1n11 ;vary n11 in [09]
Also, n12+n22+n32 = X2
Thus, the range of n12 is calculated as min(Y1n11,X2)
In the above example, it is [0min(11,12n11)]
Once we know, n11 and n12, we can find n13.
Now For n21, we consider, n21+n31=X1n11
and n21+n22+n23=Y2
Thus n21 can take any value in the interval [0min(X1n11,Y2)]
as this satisfies the condition of fixed marginal values.
After finding n21, n31 can be easily calculated from the fact that
we know n11 and n21 and X1 and n11+n21+n31=X1
To find, n22, we consider the equations, n21+n23=Y2 and n22+n32=X2n12
from which the range of n22 is found. For n22, we can find n23
and n32 and then also n33 as all other variables in the equation
are given values.
After we assign this set of values to the current table cells
we say that we obtained one of all possible contingency tables.
And then we perform the test if the point probability of this
combination is < the observed point probability if yes, then
we find the score.
Note that the process of assigning all possible values to all the
cells in the table could be easily generalised though we haven't made
an attempt here for simplicity. However, there are algorithms which
would help us in assigning the values to all matrix cells based on the
information of to which groups they belong or the marginal values
associated with the cell and which decide the range of other cells in
the same group.

PERFORMING SAME PROCEDURE FOR 3 VARIABLES


CONTINGENCY TABLE FOR 3 VARIABLES

W2 W2'
W1 n111 n121
n112 n122
W1' n211 n221
n212 n222

FINDING ALL POSSIBLE CONTINGENCY TABLES

In this we see that the marginal values are X1, X2, Y1, Y2, Z1,Z2 where
X1=W1pp, X2=W2pp
Y1=Wp1p Y2=Wp2p
Z1=Wpp1 Z2=Wpp2
these are fixed. So we vary the cells so that the rule for constant
marginal values is obeyed.
Lets start with cell n111 which can take all possible values from 0 to
the maximum decided by the marginal groups to which this cell belongs.
n111 belongs to groups W1, W2 and W3 hence, the upper limit on the
possible values n111 can take is decided by min(X1,Y1,Z1).
Vary n111 in the range [0..min(X1,Y1,Z1)]
Foreach value of n111
find possible values of n121 where n121 belongs to (X1,Y2,Z1) and hence
n121 can take values in the range [0..min(X1n11,Y2,Z1n111)]
as the bounds on X1 ,Y1 and Z1 change as n111 is assigned a value.
let X1'=X1n111 Y1'=Y1n111 and Z1'=Z1n111
so after this we get
X1'=n121 Y2'=n121 Z1'=n121
Vary n121 in this range and foreach value of n121 find possible values
of n112 which can now take the values in the range [0..min(X1',Y1',Z2)]
so vary n121 in this range.
After assiging n121 a value we immediately get n122 which is
as W1pp=n111+n121+n122+n112
update X1', Y1' and Z2 for n112 and X1',Y2' and Z2' for n122
Now we find the possible range for n211, which is [0..min(X2',Y1',Z1')]
and update X2',Y1' and Z1' by subtracting n211 from them.
For each of these we find the range for n221 which is
[0..min(X2',Y2',Z1')] and update X2', Y2' and Z1'
After we get this, n212 and n222 can be asily found from the facts
Wp1p=n111+n112+n211+n212
and Wpp2=n112+n122+n212+n222
Foreach of the contingency matrix thus formed find the point probability
associated and cumulate those which are same or less than the observed
point probability.
Find coefficient of partial correlation phi
which is
P(x,y,z)P(x)P(y)P(z)/(P1++ * P2++ * P+1+ * P+2+ * P++1 * P++2)^1/3
The copy of the paper which describes this test is provided in the
same directory for further reference.
Example
W2 W2' 
2 1 
W1  W1++= 7
2 2 
 
3 1 
W1'  W2++=9
3 2 

W+1+=10 W+2+=6 N= 16
W++1= 7 W++2=9
For n111, range is [0..min(7,10,7)] = [0..7]
For n121, range is [0..min(7n111,6,7n111)]
For n112, range is [0..min(7n111n121,6n121,9)]
n122= 7n111n121n112
For n211, range is [0..min(9,10n111n112,9n112n122)]
For n221, range is [0..min(9n211,6n121n122,7n111n121n211)]
n212 = W+1+ n111 n112 n211
n222 = W+2+ n121 n122 n221
##############################################################################
EXPERIMENTS , OBSERVATIONS AND CONCLUSIONS
=====================
FOR BIGRAMS
=====================
===================
TOP 50 COMPARISIONS
===================
With tmi.pm the Top 50 ranks obtained are
This is an output when test1.cnt was given to tmi.pm
This is directly taken from test1.tmi
228385
,<>and<>1 5137.9468 3410 15448 6652
of<>the<>2 2831.5441 2756 7919 15578
.<>The<>3 2733.2180 766 6113 785
had<>been<>4 1783.7571 468 2000 748
,<>but<>5 1515.4185 765 15448 1025
the<>enemy<>6 1300.0674 532 15578 570
on<>the<>7 1297.3587 928 1850 15578
the<>the<>8 1133.2538 1 15578 15578
,<>,<>9 1106.8880 2 15448 15448
the<>,<>10 1068.7076 11 15578 15448
.<>In<>11 849.1795 241 6113 247
o<>clock<>12 789.9850 90 90 91
General<>Grant<>13 788.2227 158 877 202
in<>the<>14 710.0651 955 3504 15578
.<>This<>15 694.0795 201 6113 210
enemy<>s<>16 682.5241 220 570 1956
did<>not<>17 636.3140 130 189 760
s<>division<>18 615.1822 177 1956 342
.<>When<>19 541.5669 152 6113 154
a<>few<>20 506.7795 140 3081 183
Court<>House<>21 506.2121 69 83 123
it<>was<>22 505.3449 252 1161 3010
of<>,<>23 502.0967 13 7919 15448
would<>be<>24 485.1509 147 572 1001
.<>He<>25 477.0930 133 6113 134
to<>be<>26 450.7241 290 6619 1001
.<>At<>27 447.2214 126 6113 128
.<>On<>28 433.4946 121 6113 122
,<>however<>29 433.1055 166 15448 169
s<>brigade<>30 432.3818 121 1956 216
,<>.<>31 427.0564 1 15448 6113
.<>As<>32 426.2314 119 6113 120
.<>It<>32 426.2314 119 6113 120
.<>the<>33 425.0891 2 6113 15578
Sixth<>Corps<>34 422.1696 65 89 189
,<>who<>35 413.9050 252 15448 415
to<>,<>36 406.7600 14 6619 15448
however<>,<>37 391.7628 159 169 15448
OF<>THE<>38 389.9416 68 159 185
United<>States<>39 388.0526 47 49 75
I<>had<>40 382.8230 245 2600 2000
;<>but<>41 379.0308 132 734 1025
could<>be<>42 371.8051 118 517 1001
,<>which<>43 368.7570 350 15448 934
I<>was<>44 360.6061 278 2600 3010
to<>the<>45 357.7431 1077 6619 15578
could<>not<>46 351.4256 106 517 760
.<>I<>47 340.4961 368 6113 2600
;<>and<>48 339.9902 217 734 6652
===============================
ANALYSIS OF THE ABOVE RESULTS
===============================
*******************
ANALYSIS 1
*******************

OBSERVATIONS

(1) Most of the collocations consist of puncuations
which dont really play any special role in
puncuations
(2) Bigrams with punctuations on the TOP???
(3) Lot of trivial collocations have got over
emphasis and higher rankings like
did not, had been, on the, of the which do occur
frequently but are not collocations as they
dont follow the noncompositionality rule.
(4) Some good collocations like United States, Court House
are lower ranked
(5) The ratio of good collocations to the bigram size
under consideration is very low which is about
0.06.
==========================
CONCLUSIONS FOR test1.tmi
==========================
True Mutual Information has some inherent drawback by which it
can't differentiate between good collocations and some trivial
collocations which do occur most frequently but dont provide any
information.
This problem could be handled using a stoplist for indivisual
words as well as for bigrams which will list not only the
tokens which can be ignored but also will list some trivial
bigrams which need not be analysed.
On the other hand, the output obtained from user2.pm shows some
good ranking scheme even when the stoplist is not provided which
is really surprising.
**********************
ANALYSIS 2
**********************
ON test1.user2
228385
o<>clock<>1 1571.3650 90 90 91
,<>and<>2 1425.9268 3410 15448 6652
Five<>Forks<>3 877.8387 34 35 37
had<>been<>4 796.4478 468 2000 748
Cold<>Harbor<>5 793.1611 24 24 24
Rio<>Grande<>6 778.9809 28 28 32
Court<>House<>7 772.4336 69 83 123
United<>States<>8 764.9516 47 49 75
.<>The<>9 738.5375 766 6113 785
Project<>Gutenberg<>10 724.5090 26 31 26
Front<>Royal<>11 711.9523 21 21 22
of<>the<>12 683.0092 2756 7919 15578
UNITED<>STATES<>13 480.8483 10 10 10
New<>Orleans<>14 471.7674 37 82 44
Sixth<>Corps<>15 458.8941 65 89 189
rifle<>pits<>16 437.9938 16 22 19
General<>Grant<>17 425.4898 158 877 202
PROJECT<>GUTENBERG<>18 418.1405 8 8 8
Crown<>Prince<>19 416.3935 19 20 35
Missionary<>Ridge<>20 397.1066 23 23 53
Corpus<>Christi<>21 383.1539 7 7 7
Father<>Pandoza<>21 383.1539 7 7 7
Cedar<>Creek<>22 376.0958 41 42 141
Count<>Bismarck<>23 356.5804 28 47 54
Literary<>Archive<>24 344.8794 6 6 6
OF<>THE<>25 341.1757 68 159 185
did<>not<>26 340.5760 130 189 760
Camp<>Supply<>27 334.1003 12 18 15
Black<>Kettle<>28 316.0808 7 9 7
MAJOR<>GENERAL<>29 310.4870 29 30 100
Small<>Print<>30 306.7674 6 7 6
Jules<>Favre<>30 306.7674 6 6 7
la<>Tour<>30 306.7674 6 7 6
Mars<>la<>30 306.7674 6 6 7
Piedras<>Negras<>31 302.2383 5 5 5
Yaquina<>Bay<>32 302.2195 9 9 15
White<>Oak<>33 296.8036 21 64 21
Bowling<>Green<>34 276.5141 6 6 8
CITY<>POINT<>35 273.6775 6 7 7
Beaver<>Dam<>36 269.5503 7 11 7
Yellow<>Tavern<>37 263.2275 12 16 23
le<>Duc<>38 263.0624 5 6 5
Medicine<>Lodge<>38 263.0624 5 6 5
Blue<>Ridge<>39 258.7102 17 17 53
San<>Antonio<>40 254.6795 8 15 8
widow<>Glenn<>41 253.4355 4 4 4
Des<>Chutes<>41 253.4355 4 4 4
Archive<>Foundation<>42 251.8580 6 6 9
,<>but<>43 232.3167 765 15448 1025
===========================
OBSERVATIONS ON test1.user2
===========================
(1) Results seem to be better than results obtained by
tmi in the sense that the overall % of trivial
collocations like had been, I was, I had is low.
(2) Overall % of collocations involving punctuations is
also low.
(3) Good collocations like United States, Court House
are given satisfactorily higher ranks.
(4) Most of the letters in this output are capital letters.
============================
CONCLUSIONS FROM test1.user2
============================
How is it possible that when the same data is given input to
two seperate tests, one contains most of the Capital Letters
while other contains almost all in lower case.
Does this mean that the user2 shows some bias towards
Capital letters or specifically speaking, bias towards those
words which start with Capital letters. As we know that none
of the program really considers the Case  senstitive handling,
the results are surprising. As user2 almost all involves words
starting with Upper case letters while tmi.pm selects all
lowercase letters.
=======================================
CONCLUSIONS ON test1.usr2 and test1.tmi
=======================================
The Top 50 list shows very few bigrams in common. When these
two (very much unrelated) files are given to rank.pl
which calculated the correlation, we confirmed the fact
by getting the value very close to 0 which means unrelated.
The scores generated in this Top 50 list were observed and
it was found that tmi.pm 's output shows the test scores
[3405137] while the range of the test_scores for user2
is [2321571] which is quite narrow compared to tmi.pm.
How to Categorize the Top 50 produced by each of the modules
(1) Divide the set of bigrams into subclasses according
to their scores. e.g. If the scores are
In the file we saw, the typical phrases like
move back, move forward, come back, move rapidly
were found close to each other in their rankings and scores.
Also, phrases like several kinds, various points, many
kinds were found together.
Following is an another example that supports this fact
prime minister, grand jury, head quarters were found close.
on account, rather than, as soon (part of as soon as),
by the (part of by the ways) were found close to each other.
another examples included by, represented by, responded by,
traversed by, influenced by were all found together.
The reasoning could be that the algorithm takes into account
the possible relation between the frequencies of the typical
phrases which could be more or less same.
(2) Also, we observed that the ranks of the Proper Nouns
are higher than any other bigrams which are listed on the
top of the list of bigarms.
(3) Bigrams containing the punctuation marks are at the bottom
of the list with low scores which is very helpful and sound.
The Cutoffpoint frequency for this could be around 7700 after which
some trivial collocations and those involving punctuation marks come.
#############################################################################
=============================
RANK COMPARISIONS
=============================
COMPARISION TABLE
ll.pm mi.pm dice.pm x2.pm
tmi.pm 1.00 0.777 0.6684 0.9275
user2.pm 0.7123 0.4083 0.1927 0.5514
This table shows various comparisions made using rank.pl
for the results obtained using several tests.
We compare here the coefficient of correlation of out tests with
the standard tests already in NSP like ll.pm, dice.pm, x2.pm or mi.pm.
The comparision table shows that
tmi is more simialr to loglikelihood test as we get 1 there.
Next it is more similar to x2 test where the coefficient is 0.9275.
The low coeffcients of comparisions for user2 with other tests shows
that this is a different kind of test as we said that this test
accepts/rejects a null hypothesis based on the cumulative probability
obtained by considering all possible contingency matrices which
have point probabilities below the observed probability. Due to its
unusual way of analysing the same problem we get low correlation
with the standard tests. However, the results in fact show very
good collocations standing in the top of the list while some
trivial ones at the bottom.
*********************************************************************************
=====================
TRIGRAM EXPERIMENTS
=====================

TOP 50 RANK

Using ll3.pm Loglikelihood
topll3 file
228384
,<>and<>the<>1 10126.2188 280 15448 6652 15578 3410 1238 470
,<>and<>,<>2 9700.4814 68 15448 6652 15448 3410 976 90
of<>,<>and<>3 9390.5841 3 7919 15448 6652 13 212 3410
to<>,<>and<>4 9236.3700 6 6619 15448 6652 14 78 3410
,<>and<>I<>5 9158.8664 148 15448 6652 2600 3410 448 190
,<>and<>of<>6 9111.1531 15 15448 6652 7919 3410 192 23
.<>,<>and<>7 9003.4943 7 6113 15448 6652 52 42 3410
,<>and<>as<>8 8976.2952 124 15448 6652 1431 3410 241 144
,<>and<>to<>9 8846.3357 60 15448 6652 6619 3410 299 82
,<>and<>that<>10 8754.8857 99 15448 6652 2406 3410 242 109
,<>and<>then<>11 8707.3660 71 15448 6652 312 3410 78 96
however<>,<>and<>12 8667.4399 28 169 15448 6652 159 30 3410
,<>and<>in<>13 8657.3422 83 15448 6652 3504 3410 258 104
in<>,<>and<>14 8594.8857 5 3504 15448 6652 17 47 3410
I<>,<>and<>15 8508.5442 1 2600 15448 6652 17 2 3410
,<>and<>on<>16 8471.9385 62 15448 6652 1850 3410 153 74
,<>and<>when<>17 8458.3958 53 15448 6652 401 3410 93 61
,<>and<>he<>18 8456.1310 47 15448 6652 1126 3410 212 50
was<>,<>and<>19 8447.5863 1 3010 15448 6652 24 39 3410
,<>and<>was<>20 8446.6748 38 15448 6652 3010 3410 313 43
,<>and<>we<>21 8412.2439 49 15448 6652 796 3410 150 56
,<>and<>a<>22 8387.9178 54 15448 6652 3081 3410 259 115
,<>and<>with<>23 8346.7838 43 15448 6652 1592 3410 122 50
,<>and<>at<>24 8344.0516 44 15448 6652 1498 3410 147 51
,<>and<>by<>25 8338.1816 40 15448 6652 1677 3410 187 47
,<>and<>had<>26 8315.8525 29 15448 6652 2000 3410 219 34
,<>and<>it<>27 8312.4198 32 15448 6652 1161 3410 189 41
,<>and<>also<>28 8296.5899 33 15448 6652 214 3410 53 37
,<>and<>from<>29 8281.7874 32 15448 6652 1319 3410 71 41
by<>,<>and<>30 8278.4059 2 1677 15448 6652 9 24 3410
with<>,<>and<>31 8276.6025 4 1592 15448 6652 7 51 3410
s<>,<>and<>32 8275.1251 5 1956 15448 6652 26 94 3410
,<>and<>there<>33 8267.2684 33 15448 6652 428 3410 75 52
,<>and<>though<>34 8264.2430 28 15448 6652 148 3410 33 31
,<>and<>this<>35 8260.8464 32 15448 6652 983 3410 80 35
,<>and<>after<>36 8249.5074 30 15448 6652 312 3410 55 33
,<>and<>for<>37 8243.3453 16 15448 6652 1531 3410 49 22
me<>,<>and<>38 8243.1961 30 768 15448 6652 104 34 3410
cavalry<>,<>and<>39 8242.2104 22 368 15448 6652 95 24 3410
on<>,<>and<>40 8238.1017 2 1850 15448 6652 33 16 3410
from<>,<>and<>41 8225.8020 1 1319 15448 6652 6 17 3410
him<>,<>and<>42 8224.4330 27 557 15448 6652 83 29 3410
Corps<>,<>and<>43 8206.2651 19 189 15448 6652 60 20 3410
men<>,<>and<>44 8206.2261 21 427 15448 6652 90 25 3410
Creek<>,<>and<>45 8204.7301 16 141 15448 6652 58 17 3410
,<>and<>my<>46 8202.8438 27 15448 6652 1265 3410 97 44
line<>,<>and<>47 8191.8624 23 349 15448 6652 61 26 3410
road<>,<>and<>48 8189.9568 14 322 15448 6652 87 17 3410
,<>and<>his<>49 8189.0707 26 15448 6652 1315 3410 115 51
======================
OBSERVATIONS
======================
(1) There are no good collocations as most of the trigrams
consist of ounctuations and trivial phrases and conjunctions
these could be improved using the stopfile
#################################################################################

CUTOFF POINTS

For test.ll3
No cutoff points was observed as the collocations are trivial and
we can't obviously see good collocations separated from the
noncollocations. The results is not naturally categorized which was the
case with user2.pm. Its hard from this trigram table to determine
the cutoff point as even trgrams on the top of the list are not so
good.
#################################################################################
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
REPORT

Prashant Rathi
Date 11th Oct. 2002
CS8761  Natural Language Processing

Assignment no.2
Objective :: To investigate various measures of association that can
be used to identify collocations in large corpora of text. To
identify and implement a measure that can be used with 2 and 3 word
sequences, and compare this method with some standard measures.
The experiments were performed on a large Corpus.It is available at
/home/cs/rath0085/CS8761/nsp/brown.txt . It contains about 1510192
bigrams and 151890 trigrams.

EXPERIMENT 1:
File No.1  tmi.pm implements true mutual information for 2 word
sequences. It is calculated as, TMI = summation( P(x,y) log
P(x,y)/P(x) * P(y)) It is implemented for bigrams and the count file
generated by count.pl is passed to it.The program can be run as,
statistic.pl tmi outputfile inputfile
File No.2  user2.pm implements the Yule's Coefficient of
Association for bigrams. The reference is available in the book
Multiway Contigency Table Analysis for Social Sciences by Thomas
D. Wickens.It is formulated using the crossproduct ratio alpha. It
is the ratio of the odds in the first row (or column) to that in
second row (or column). alpha = N11 * N22 / N21 * N12 The Yule's
coefficient of Association is given as, Q = (alpha 1)/(alpha + 1)
The Program shows same results as obtained by hand computations.
Top 50 Comparisons:
Comparing the first 50 ranks produced in both the outputs for
tmi and user2 yields interesting results.First few ranks produced by
tmi are,
of<>the<>1 0.0098 9123 36043 62690 .<>The<>2 0.0071 3507 49673
6921 in<>the<>3 0.0059 5265 19580 62690 .<>He<>4 0.0049 2029
49673 2991 ,<>and<>5 0.0046 5376 58974 27952 .<>It<>6 0.0033
1405 49673 2185 ,<>but<>7 0.0031 1615 58974 3006 on<>the<>8
0.0029 2190 6405 62690 to<>be<>8 0.0029 1606 25724 6340
United<>States<>9 0.0027 351 456 447 the<>the<>10 0.0026 3
62690 62690 ...........
It is seen here that the collocations are commonly occurring ones
like "of the","on the" and they are large in number. Such
collocations are observed in the top 50 ranks of tmi.pm.On the
other hand first few ranks produced by user2.pm are,
cowardly<>compromising<>1 1.0000 1 3 4 tropical<>sprue<>1
1.0000 1 11 2 Elisabeth<>Carroll<>1 1.0000 1 3 17
Nero<>Wolfe<>1 1.0000 2 3 9 aviation<>gasoline<>1 1.0000 1 2
11 polite<>minuet<>1 1.0000 1 7 2 favored<>splitting<>1 1.0000
1 18 3 powered<>radios<>1 1.0000 1 6 7 Brooks<>Atkinson<>1
1.0000 1 17 2 heroin<>addicts<>1 1.0000 1 2 4.....
Here the collocations are more interesting as they are uncommon
ones like "cowardly compromising".They occur very infrequently.Also
it can be seen here that many collocation give same value for this
test and rank 1 has many entries. I think user2.pm is better than
tmi to get interesting collocations as tmi.pm produces ranks highly
dependant on the no. of times they occur.
Cutoff Point
tmi. pm produces values for bigrams in the range from 0 to 1.It
produces very few ranks ( many collocations have the same rank) as
compared to user2.pm.The cutoff point for tmi.pm is as shown,
with<>and<>32 0.0001 7 7007 27952 is<>one<>32 0.0001 100 9969
3073 5<>culminating<>33 0.0000 1 10611 2 the<>Sierras<>33
0.0000 2 62690 2 ......
The values for all next bigrams is 0 which is true for more that 90%
of bigrams.Such bigrams are very uncommon and so their probability
of occurrence nears 0.
user2.pm produces similar values for many bigrams and the range
observed is from 1 to 1.The cutoff point observed is as shown and
below this all trigrams give negative values which indicates that
the crossproduct ratio is less than 1.
amount<>for<>9872 0.0001 1 171 8833 or<>look<>9872 0.0001 1 4127
366 the<>J56<>9873 0.0000 10 62690 241 to<>G62<>9874 0.0001 4
25724 235 to<>G53<>9874 0.0001 4 25724 235 ......
This also means that the two words in such bigrams occur
individually a lot more than they occur together. Thus, tmi
produces shows a steep curve while for user2 the values decrease
gradually.This may be because the frequency count dominates tmi more
than it does in user2.
Rank Comparison
tmi produces rank correlation coefficient of 0.9493 with mi.pm and
user2.pm produces rank correlation coefficient of 0.7311 with the
same. This means that tmi tends to produce results opposite to mi
while user2.pm produces results somewhat similar to mi.Here also the
corpus size matters as if the corpus size is small then tmi produces
results more similar to mi then is the case for large corpus.
With ll3, it was observed that tmi produces 0.9491 as the rank
correlation coefficient and user2 produces 0.5863 with ll. Again we
can say that user2 is more similar to ll than tmi.These results
taken for large corpus have shown above values, may be the results
will be different for other corpus.
Overall Recommendations
From the observed results, I think ll and user2 are good tests for
identifying collocations. Tmi shows questionable results when
compared with mi for large corpus. This is because tmi produces very
few ranks i.e. the values tend to be similar and hence due to the
limited range of values the rank correlation coefficient differs.It
is difficult to identify interesting collocations on the basis of
tmi values.Also as we observed above, user2 helps us to find
interesting and unfrequent collocations.

EXPERIMENT 2:
File No.3  ll3.pm implements the loglikelihood ratio for trigrams.
The formula for the same is, GSquare = 2 * summation(Obs.value *log(
Obs. value/Expected Value))
The observed values can be seen by building a 2 x 2 x 2 table. The
Expected value is calculated using the null hypothesis
P(x,y,z)=P(x)P(y)P(z). Thus the expected value in terms of
frequency count is, exp(x,y,z)=n(x)*n(y)*n(z)/
(totaltrigrams)*(totaltrigrams)
File No.4  user3.pm implements the Yule's Coefficient of
Association for trigrams. The extension of the test for three
dimensions is mentioned in Ch.9 of the book Multiway Contigency
Table Analysis for Social Sciences by Thomas D. Wickens. It is
formulated by extending the crossproduct ratio to three
dimensions.It is given as, alpha = (N111 * N122 * N212 * N221)/(N112
* N211 * N121 * N222) The Yule's coefficient of Association is given
as, Q = (alpha 1)/(alpha+1) The Program shows same results as
obtained by hand computations.
Top 50 Comparisons:
The following results are observed for top 50 ranks in ll3.pm.
of<>the<>fact<>41 30378.7574 19 36043 62690 446 9123 21 159
of<>the<>way<>42 30365.6054 14 36043 62690 909 9123 21 215
of<>the<>state<>43 30365.0389 33 36043 62690 603 9123 44 180
of<>the<>last<>44 30358.6645 16 36043 62690 642 9123 20 183
of<>the<>best<>45 30357.9556 19 36043 62690 349 9123 24 141
series<>of<>the<>46 30357.4622 1 122 36043 62690 76 3 9123
because<>of<>the<>47 30355.6422 56 807 36043 62690 156 59 9123
front<>of<>the<>48 30342.5881 40 226 36043 62690 97 45 9123
of<>the<>entire<>49 30338.2260 14 36043 62690 148 9123 15 97
One<>of<>the<>50 30328.0345 77 417 36043 62690 100 83 9123.......
The values are very high as can be seen in the table.The values do
not directly depend on the frequency of occurrence of trigrams.
For user3.pm the trigrams in the first 50 are quite different from
that of ll3.pm.Here the values do not change that frequently and
trigrams that are seen in the first few ranks are very infrequent.
a<>6<>6<>1 1.0000 1 21994 10435 10435 10 2 4 a<>5<>4<>1 1.0000 1
21994 10611 10399 6 2 2 and<>3<>8<>1 1.0000 2 27952 10406 10326 21
3 5 the<>8<>4<>1 1.0000 1 62690 10326 10399 6 4 4 4<>7<>8<>1
1.0000 1 10399 10291 10326 2 4 3 an<>11<>3<>1 1.0000 1 3543 6866
10406 3 2 3 a<>,<>9<>1 1.0000 1 21994 58974 10010 5 2 4
............
Here ll3 seems to be the better test to identify the interesting
collocations because it produces different values for these trigrams
which helps in distinguishing them and also the frequency of
trigrams doesn't affect these values.
Cutoff Point
Both ll3 and user3 produce quite different values for the trigrams
and the ranks produced are also different.There is no particular
point which can be called as the cutoffpoint below or above of
which we can find interesting collocations. For ll3 the range of
values is large and the curve will be steady whereas for user3 the
values are in the range of 1 to 1 and the curve will have flat
portions as there are many trigrams with similar values.
Rank Comparison
The rank correlation coefficient between the ll3.pm and user3.pm as
shown by rank.pl is 0.0153.This value is pretty much close to 0 and
this shows that these tests do not have much relation among
them.This can be easily seen from the results.
Overall Recommendations
The loglikelihood test implemented for trigrams is much better than
the user3 (yule's coefficient).This is beacause ll3 identifies
interesting collocations very well due to the large range of
differing values it produces. It is also observed that the
frequency of trigram occurrence does not play a major role in
computing values using these tests.
The Yule's Coefficient calculated also gives a measure of
association different from the ones already existing in NSP.
Conclusion : The test results differ depending on the corpus
size.tmi and mi show very contrasting results for large corpus.The
log likelihood ratio shows good results when extended for
trigrams.The measure of association implemented by me shows
similarity with log likelihood more for bigrams then for trigrams.

++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
CS8761: Assignment 2  Statistical Models for NSP (due 10/11/2002 12pm)
Sam Storie
ID 1824410
10/8/2002
++
 Methods 
++
Introduction:
The main goal of this assignment is to implement a couple different
statistical measures as modules within the NGram Statistics Package (NSP).
NSP is useful for counting ngram sequences within bodies of text and then
performing some statistical analysis on the data generated. In addition to
providing 4 statistical measures, NSP provides
a modular interface for adding new statistical tests, and also provides a
program to perform Spearman's Rank Correlation Coefficient (which is useful
to show how similar two rankings of values are). Given all these tools, NSP
makes it relatively easy to observe how related different statistical
measures actually are.
All the measures included with the NSP package are designed to work with
bigram counts, but some statistical measures are able to work with trigrams
or even any ngram count. A good example is the loglikelihood ratio, which
comes as a bigram test with NSP, but is easily expanded to analyze a count of
trigrams. This was basis for one of our tasks in this assignment. NSP also
comes with a module to determine the "pointwise mutual information" value.
This value is interesting, but it only considers a specific case in its
computation. A possibly more meaningful value is the "true mutual information"
which enumerates over all the values in a distribution in an attempt to
consider all the counts. This statistic was the basis for another
portion of the assignment.
The tasks for this assignment included:
(a) Implementing a module to determine the "true" mutual information
(b) Finding a new statistical measure suitable for 2 and 3 word sequences
and implementing it for both cases
(c) Implementing a 3word version of the loglikelihood ratio
For my new measure I decided to use the KolmogorovSmirnov test (2) for
goodness of fit between two distributions. It performs this test by
enumerating over all the samples in the distributions, and summing the
probabilities as it does so. The result is a cumulative sum of the different
probabilities after each iteration. Then this test compares the sum of one
distribution to the sum of the other and tracks the largest difference seen
at each iteration. This value, called the 'D' value, is then compared against
a standard table to determine if it has any statistical significance.
Since this test compares the expected counts to the observed counts it scales
nicely from the 2word case to the 3word case. The 3word case differs from
the two word case in only the number of cells of counts (2word has 4 cells
in the contingency table and the 3word case has 8). Examples of the actual
computations performed with this test are included in the user2.pm and
user3.pm source files.
Experiment one is designed to compare my true mutual information module
with my KolmogorovSmirnov (KS) implementation (user2.pm). Then I will compare my modules
to the loglikelihood and pointwise mutual information modules included with
NSP. Finally I will try to give a recommendation on what I feel is the
best measure for finding interesting 2word collocations.
Experiment two is similar to the first except it's concerned with 3word
sequences. I will compare my 3word version of my KS module (user3.pm) with my
implementation of the 3word loglikelihood module. As with the first
experiment, I will comment on which test I feel produces the most useful
output.
The corpus used:
I combined several files from the BROWN1 corpus supplied by Dr. Pedersen
to create a corpus of approximately 1,015,000 words. I feel this is a
sufficiently large sample to base my experiments on. Also, the BROWN corpora
strive to be representative of a wide variety of texts, so this is even more
incentive towards its use. I also removed the metaimformation from the text
files to my corpus consisted of only the original text in the the BROWN
corpus.
++
 Experiment One 
++
TOP 50 COMPARISON
When the top 50 bigram results produced by the true mutual information (tmi)
module are compared to the results of my KS module some interesting results
emerge. On the surface they appear to produce similar results, but upon a
closer inspection something else appears. It is true that some bigrams are
ranked high with each case, but the KS module seems to be affected more by
the marginal values than the tmi module is. The tmi module ranked some
semantically interesting bigrams rather high. "United States" for example, is
ranked 11th, but fails to appear in the top 50 KS rankings. Another bigram that
is ranked high by tmi is "New York", but again it fails to appear in the top
50 for KS. On the other hand, the KS rankings seem to favor those bigrams
containing elements that occur quite often outside the bigram. For example,
many of the top 50 bigrams in the KS rankings include a punctuation mark.
After some thought I think because the tmi uses the probability of the
different cells to provide some "weight" to how much influence that cell has
on the total result. Those bigrams that occur more than expected otherwise
will have a larger influence over those that are more evenly distributed
across the contingency table. With KS a different case can be made. KS is
comparing the sums of the probabilities, so those bigrams that have elements
with larger marginal values than we expect could influence the result more
than those bigrams that simply occur more than we might otherwise expect. The
result is the KS test might miss those bigrams with semantic importance, but
it still finds those bigrams with counts that deviate from what we would
expect. Also, the KS test does not discriminate against observed counts that
are higher *or* lower than expected. It only examines the deviation when it
assigns a value to the result.
Based on initial observations I would say that the tmi module finds more
interesting bigrams than the KS test. I feel this is because of the "weight"
the tmi test places on the contribution of each cell in the contingency table.
It should be stated that both tests fail miserably if a person was concerned
with bigrams that form collocations satisfying the compositional
meaning requirement. These tests are designed to find those bigrams that
deviate from what we would expect in a quantitative sense, and not a
qualitative sense.
Tables 1 and 2 (at the end of this section) show the top 50 bigram output
produced by the tmi and KS modules.
CUTOFF POINT
To determine an appropriate cutoff is difficult with these two measures
because, as I stated earlier, they don't rank based on any sort of semantic
meaning. They are simply ranked based on the frequency counts for the various
cells in the contingency table. Also, as seen from the output, the values
calculated are all fairly close to zero, so the interpretation of these
results is difficult. None of the 'D' values computed satisfy the critical
value associated with a distribution of 4 samples. There is some vague cutoff
that one could argue based on the transition from punctuation containing
bigrams to those with semantic words. In both cases the rankings eventually
trend towards bigrams containing two words, but it doesn't really start until
the 1000th word in the ranking or so. This is a gross approximation, but it
still helps illustrate my difficulty in determining an appropriate cutoff
value for either of these distributions.
RANK COMPARISON
These values were computed using the rank.pl program included in NSP. In
each comparison the ll.pm or mi.pm results were always listed first in the
command line for rank.pl. The results are summarized in the table below:
tmi.pm user2.pm (KS module)
+++
ll.pm  .9616  .9805 
+++
mi.pm  .9613  .9801 
+++
In each case the results indicate a near reverse ranking for each
comparison. This may seem very strange but I believe there is a reason. I feel
that because neither the ll.pm module or the mi.pm module directly consider
the number of bigrams in the entire sample space, they would produce results
opposite of the other case. The major factor in both the ll.pm and mi.pm
modules is the comparison between the observed and expected counts. The ll.pm
module goes one step farther and factors in the counts for each cell as it
enumerates over all of them. The important difference is that the tmi.pm and
user2.pm modules use a technique that scales these values according to the
total number of bigrams in the sample space, and I believe that will have
an inverselike effect on the rankings for many of the bigrams. The excpetions
are those that occur so often as to remain relatively unaffected by these
skewing factors, but this is shown by the resultant comparison *not* being
1.0, but rather some value close to it.
A rank.pl computation on the tmi.pm results with the user2.pm results shows
a value of .9999, which clearly indicates a relationship between the results
of the two tests. I feel this relationship has something to do with how they
consider the total number of bigrams in their computations.
OVERALL RECOMMENDATION
In order to make a recommendation, one must be clear on what sort of
bigrams are important. I feel that both the tmi.pm and user2.pm do a good
job of finding bigrams that contain words with some sort of relationship. The
importance of this relationship is up to the user however. If I was to look
for bigrams that form some collocation with compositional meaning, then these
tests would fail me miserably. However, if I was just searching for bigrams
with words that occur together more often than I might expect if they were
independent (represented by the "expected" distribution), then I feel these
provide a good measure for the top 100 or so rankings. My prior knowledge
contains the fact that the loglikelihood test is considered to be a rather
meaningful test in this context. Regardless of the Spearman result, the
top 100 rankings of the tmi.pm, user2.pm, and ll.pm modules contain many
similar bigrams. I think this helps show the validity of both the tmi.pm and
user2.pm modules, and their usefulness when searching for bigrams that
contain words occuring together more often than we might expect.
On the contrary, the mi.pm module is very good at finding bigrams with
words for some "greater" meaning. The mi.pm module does not consider the
sheer number of bigrams in its computation so if some bigram occurs more often
than is expected from the marginal counts, it will be ranked higher. This
test does not care that in the scheme of the entire body of text, a few
occurances of that bigram might not be quantitatively important.
Table 1: The NSP output for the top 50 bigrams ranked by tmi

1152569
.<>The<>1 0.0142 5965 49673 6920
of<>the<>2 0.0079 9623 36043 62690
.<>He<>3 0.0070 2775 49673 2991
.<>It<>4 0.0049 1999 49673 2184
in<>the<>5 0.0047 5557 19581 62690
,<>and<>6 0.0045 6332 58974 27952
.<>In<>7 0.0040 1598 49673 1736
.<>But<>8 0.0033 1298 49673 1371
,<>but<>9 0.0032 1882 58974 3006
the<>the<>10 0.0031 3 62690 62690
United<>States<>11 0.0028 393 456 447
.<>This<>12 0.0027 1087 49673 1175
to<>be<>13 0.0025 1697 25725 6340
,<>,<>13 0.0025 61 58974 58974
on<>the<>14 0.0024 2308 6405 62690
.<>the<>14 0.0024 6 49673 62690
the<>.<>14 0.0024 15 62690 49674
.<>,<>15 0.0023 1 49673 58974
had<>been<>15 0.0023 760 5096 2470
didn<>t<>16 0.0022 392 394 2151
.<>They<>17 0.0021 847 49673 901
,<>.<>17 0.0021 57 58974 49674
don<>t<>17 0.0021 387 388 2151
.<>There<>18 0.0020 811 49673 901
.<>I<>18 0.0020 1832 49673 5877
New<>York<>18 0.0020 296 554 302
have<>been<>18 0.0020 651 3887 2470
.<>She<>18 0.0020 817 49673 895
has<>been<>19 0.0019 566 2424 2470
.<>A<>19 0.0019 1007 49673 1536
.<>.<>19 0.0019 2 49673 49674
.<>And<>19 0.0019 774 49673 853
would<>be<>20 0.0016 613 2675 6340
will<>be<>20 0.0016 592 2201 6340
of<>,<>21 0.0015 26 36043 58974
U<>S<>21 0.0015 236 324 407
.<>If<>21 0.0015 629 49673 721
did<>not<>21 0.0015 439 991 4421
.<>We<>21 0.0015 656 49673 768
may<>be<>22 0.0014 458 1286 6340
can<>be<>22 0.0014 507 1887 6340
it<>is<>22 0.0014 883 6873 9969
more<>than<>23 0.0013 388 2125 1793
It<>is<>23 0.0013 577 2184 9969
he<>had<>23 0.0013 670 6799 5096
at<>the<>23 0.0013 1508 4966 62690
.<>When<>23 0.0013 528 49673 580
the<>same<>23 0.0013 595 62690 677
of<>.<>23 0.0013 25 36043 49674
from<>the<>23 0.0013 1352 4190 62690

Table 2: The NSP output for the top 50 bigrams ranked by the KS module

1152569
of<>the<>1 0.0066 9623 36043 62690
.<>The<>2 0.0049 5965 49673 6920
,<>and<>3 0.0043 6332 58974 27952
in<>the<>4 0.0039 5557 19581 62690
the<>the<>5 0.0030 3 62690 62690
,<>,<>6 0.0026 61 58974 58974
the<>.<>7 0.0023 15 62690 49674
.<>the<>7 0.0023 6 49673 62690
.<>He<>7 0.0023 2775 49673 2991
.<>,<>8 0.0022 1 49673 58974
,<>.<>8 0.0022 57 58974 49674
.<>.<>9 0.0019 2 49673 49674
.<>It<>10 0.0017 1999 49673 2184
to<>the<>10 0.0017 3402 25725 62690
on<>the<>10 0.0017 2308 6405 62690
of<>,<>11 0.0016 26 36043 58974
,<>but<>12 0.0015 1882 58974 3006
.<>I<>13 0.0014 1832 49673 5877
.<>In<>14 0.0013 1598 49673 1736
to<>be<>14 0.0013 1697 25725 6340
of<>.<>14 0.0013 25 36043 49674
,<>of<>15 0.0012 418 58974 36043
.<>But<>16 0.0011 1298 49673 1371
for<>the<>16 0.0011 1756 8833 62690
at<>the<>16 0.0011 1508 4966 62690
to<>,<>16 0.0011 42 25725 58974
a<>,<>17 0.0010 8 21998 58974
a<>the<>17 0.0010 2 21998 62690
,<>he<>17 0.0010 1523 58974 6799
from<>the<>17 0.0010 1352 4190 62690
.<>and<>17 0.0010 4 49673 27952
with<>the<>17 0.0010 1477 7007 62690
and<>,<>17 0.0010 304 27952 58974
and<>.<>17 0.0010 3 27952 49674
the<>a<>17 0.0010 1 62690 21998
the<>in<>18 0.0009 5 62690 19581
.<>This<>18 0.0009 1087 49673 1175
to<>.<>18 0.0009 47 25725 49674
by<>the<>18 0.0009 1316 5124 62690
in<>a<>19 0.0008 1318 19581 21998
in<>,<>19 0.0008 73 19581 58974
of<>and<>19 0.0008 9 36043 27952
a<>.<>19 0.0008 15 21998 49674
.<>a<>19 0.0008 4 49673 21998
.<>A<>19 0.0008 1007 49673 1536
of<>a<>20 0.0007 1467 36043 21998
.<>She<>20 0.0007 817 49673 895
as<>a<>20 0.0007 909 6691 21998
and<>of<>20 0.0007 111 27952 36043
it<>is<>20 0.0007 883 6873 9969

++
 Experiment Two 
++
TOP 50 COMPARISON
Upon viewing the top 50 rankings for the first time I was rather surprised
to see the results I obtained. The rankings were clearly dependent on the
first two words in the trigrams, and this produced fairly different ranks for
the trigrams. The top 50 rankings produced by the ll3 module are dominated by
the ".<>The<>*" trigram (where * is meant to represent any word). The user3
module rankings were dominated by the "of<>the<>*" trigram. The actual output
has been included at the end of this section as table(s) 3 & 4.
I think this isn't so much a flaw in the modules, but rather a tendency of
trigrams. For the ll3 module, it's not much different from the 2word version
and bases it's final value on the sheer number of times the trigrams components
occur. It's easy to see from the output that the bigram ".<>The" occurs very
often and it has an relatively high count of 49673. In fact the output is
essentially the 2word ranking, but each bigram from the two word has a third
word added on in the rankings. Due to space concerns I can not show this in
this report, but I have checked it and it is the case. It is the same situation
for the user3.pm module. It ranked the words in the same general order as the
2word version, but appended a third word for each bigram. This seems very
strange, but because the calculated values do differ between the trigrams in
the ll3 rankings, but the precision isn't large enough to show a big difference
in the user3 rankings.
CUTOFF POINT
Similar to the 2word case I feel determining a cutoff value for the
trigram rankings is very difficult. It seems as though the trigram rankings
are merely an expansion on the bigram rankings, so for the same reasons a
cutoff value would not mean much. One difference with the trigram rankings
however is there starts to emerge a semantic meaning for the trigrams that
isn't as prevelant with the bigrams. This is probably due to a trigram being
able to convey more meaning with its three words than a bigram. Also this
ranking is still based on sheer frequency counts over meaning, so any cutoff
falls prey to the wishes of the person assigning it.
OVERALL RECOMMENDATION
Again, similar to the bigram case inorder to make a recommendation it
must be known what the user is looking for. With the absence of the pointwise
mutual information module, they are essentially stuck with a ranking based
on frequency (and not meaning). The strange correlation between the bigram
and trigram cases makes me wonder if there's either an error in my
mathematical procedure, or if these two tests do not provide as meaningful
information for trigrams as they do bigrams. In either case, I feel they both
provide some meaningful information and do find trigrams that occur more
frequently than a person might expect. However, if the goal was to find
trigrams with a higher semantic meaning, then I think a person would be better
off using a different statistic.
Table 3: The top 50 rankings produced by the ll3 module

1152568
.<>The<>would<>1 32790.1157 2 49673 6920 2675 5965 291 2
.<>The<>same<>2 32774.7950 28 49673 6920 677 5965 32 31
.<>The<>New<>3 32773.2097 14 49673 6920 554 5965 27 29
.<>The<>will<>4 32772.9476 1 49673 6920 2201 5965 196 1
.<>The<>United<>5 32772.5431 14 49673 6920 456 5965 15 17
.<>The<>I<>6 32772.2869 2 49673 6920 5877 5965 341 2
.<>The<>American<>7 32769.1592 18 49673 6920 591 5965 22 21
.<>The<>Man<>8 32768.3984 1 49673 6920 64 5965 1 7
.<>The<>Prince<>9 32763.9745 3 49673 6920 33 5965 3 10
.<>The<>Lord<>10 32762.9355 1 49673 6920 98 5965 2 6
.<>The<>world<>11 32762.5045 11 49673 6920 739 5965 18 13
.<>The<>Commission<>12 32761.6256 3 49673 6920 78 5965 4 9
.<>The<>Walnut<>13 32761.1441 1 49673 6920 8 5965 1 6
.<>The<>addition<>14 32760.9312 1 49673 6920 138 5965 80 1
.<>The<>party<>15 32760.5849 4 49673 6920 207 5965 4 4
.<>The<>Government<>16 32760.4715 10 49673 6920 157 5965 10 15
.<>The<>city<>17 32759.9822 6 49673 6920 290 5965 7 8
.<>The<>National<>18 32759.5559 6 49673 6920 160 5965 6 8
.<>The<>children<>19 32759.2380 10 49673 6920 359 5965 14 10
.<>The<>door<>20 32758.8067 5 49673 6920 321 5965 6 6
.<>The<>Space<>21 32758.6269 1 49673 6920 13 5965 1 5
.<>The<>work<>22 32758.1267 15 49673 6920 758 5965 31 15
.<>The<>great<>23 32757.8663 19 49673 6920 615 5965 29 20
.<>The<>Holy<>24 32757.7263 2 49673 6920 32 5965 2 6
.<>The<>state<>25 32757.5994 9 49673 6920 603 5965 18 13
.<>The<>place<>26 32757.4519 10 49673 6920 545 5965 17 11
.<>The<>higher<>27 32757.1897 3 49673 6920 154 5965 3 3
.<>The<>President<>28 32757.1333 21 49673 6920 290 5965 22 28
.<>The<>sun<>29 32757.0760 7 49673 6920 117 5965 7 7
.<>The<>answer<>30 32756.9679 4 49673 6920 149 5965 10 10
.<>The<>Soviet<>31 32756.8918 4 49673 6920 138 5965 4 5
.<>The<>spirit<>32 32756.7985 3 49673 6920 162 5965 3 4
.<>The<>book<>33 32756.7591 6 49673 6920 184 5965 7 6
.<>The<>present<>34 32756.7460 8 49673 6920 396 5965 17 14
.<>The<>City<>35 32756.6942 4 49673 6920 134 5965 4 5
.<>The<>job<>36 32756.6888 4 49673 6920 240 5965 5 4
.<>The<>Carleton<>37 32756.6658 1 49673 6920 28 5965 1 4
.<>The<>room<>38 32756.5881 6 49673 6920 380 5965 10 6
.<>The<>various<>39 32756.5876 5 49673 6920 195 5965 6 5
.<>The<>very<>40 32756.5023 8 49673 6920 772 5965 20 8
.<>The<>direction<>41 32756.4817 3 49673 6920 132 5965 3 3
.<>The<>total<>42 32756.4491 9 49673 6920 208 5965 10 10
.<>The<>bed<>43 32756.4458 3 49673 6920 131 5965 3 3
.<>The<>water<>44 32756.4428 7 49673 6920 452 5965 13 7
.<>The<>list<>45 32756.4096 3 49673 6920 130 5965 3 3
.<>The<>meaning<>46 32756.2988 3 49673 6920 127 5965 3 3
.<>The<>ball<>47 32756.2627 4 49673 6920 107 5965 4 4
.<>The<>physical<>48 32756.2611 3 49673 6920 126 5965 3 3
.<>The<>audience<>49 32756.1933 8 49673 6920 115 5965 8 8
.<>The<>principle<>50 32756.1886 2 49673 6920 107 5965 2 4

Table 4: The top 50 rankings produced by the user3 module

1152568
of<>the<>.<>1 0.0067 2 36043 62690 49674 9623 2086 15
of<>the<>next<>1 0.0067 4 36043 62690 362 9623 5 185
of<>the<>end<>1 0.0067 3 36043 62690 417 9623 10 204
of<>the<>only<>1 0.0067 1 36043 62690 1642 9623 11 204
of<>the<>corrosive<>2 0.0066 1 36043 62690 4 9623 1 1
of<>the<>Walkers<>2 0.0066 1 36043 62690 1 9623 1 1
of<>the<>gain<>2 0.0066 3 36043 62690 75 9623 5 6
of<>the<>denominational<>2 0.0066 2 36043 62690 9 9623 4 4
of<>the<>24<>2 0.0066 1 36043 62690 74 9623 1 4
of<>the<>cathedrals<>2 0.0066 1 36043 62690 3 9623 1 2
of<>the<>Lewis<>2 0.0066 1 36043 62690 63 9623 3 3
of<>the<>fortune<>2 0.0066 1 36043 62690 25 9623 3 3
of<>the<>nineteenth<>2 0.0066 14 36043 62690 54 9623 17 31
of<>the<>21<>2 0.0066 5 36043 62690 63 9623 6 8
of<>the<>stall<>2 0.0066 4 36043 62690 17 9623 4 9
of<>the<>pictures<>2 0.0066 3 36043 62690 63 9623 6 9
of<>the<>methods<>2 0.0066 1 36043 62690 131 9623 4 22
of<>the<>20<>2 0.0066 1 36043 62690 148 9623 6 4
of<>the<>destruction<>2 0.0066 2 36043 62690 40 9623 3 12
of<>the<>stag<>2 0.0066 1 36043 62690 8 9623 1 2
of<>the<>destructive<>2 0.0066 2 36043 62690 28 9623 2 4
of<>the<>greater<>2 0.0066 1 36043 62690 180 9623 9 24
of<>the<>muscles<>2 0.0066 1 36043 62690 31 9623 2 5
of<>the<>Force<>2 0.0066 1 36043 62690 31 9623 1 2
of<>the<>basement<>2 0.0066 1 36043 62690 30 9623 1 15
of<>the<>establishment<>2 0.0066 1 36043 62690 52 9623 1 26
of<>the<>stem<>2 0.0066 3 36043 62690 30 9623 3 6
of<>the<>quantum<>2 0.0066 1 36043 62690 5 9623 1 2
of<>the<>17<>2 0.0066 1 36043 62690 53 9623 1 2
of<>the<>Old<>2 0.0066 12 36043 62690 93 9623 13 38
of<>the<>celebration<>2 0.0066 1 36043 62690 14 9623 1 2
of<>the<>regiment<>2 0.0066 2 36043 62690 22 9623 2 10
of<>the<>claim<>2 0.0066 3 36043 62690 97 9623 7 10
of<>the<>views<>2 0.0066 1 36043 62690 49 9623 3 7
of<>the<>13<>2 0.0066 2 36043 62690 65 9623 4 5
of<>the<>12<>2 0.0066 1 36043 62690 146 9623 9 12
of<>the<>David<>2 0.0066 1 36043 62690 55 9623 4 2
of<>the<>mushroom<>2 0.0066 1 36043 62690 2 9623 1 1
of<>the<>white<>2 0.0066 5 36043 62690 273 9623 11 31
of<>the<>11<>2 0.0066 1 36043 62690 96 9623 3 5
of<>the<>10<>2 0.0066 3 36043 62690 251 9623 9 14
of<>the<>Criminal<>2 0.0066 2 36043 62690 4 9623 2 3
of<>the<>unimproved<>2 0.0066 1 36043 62690 2 9623 1 1
of<>the<>vehicle<>2 0.0066 1 36043 62690 33 9623 3 5
of<>the<>unification<>2 0.0066 1 36043 62690 9 9623 1 4
of<>the<>really<>2 0.0066 1 36043 62690 267 9623 5 7
of<>the<>Germans<>2 0.0066 2 36043 62690 27 9623 2 15
of<>the<>Eight<>2 0.0066 1 36043 62690 11 9623 1 1
of<>the<>Iliad<>2 0.0066 6 36043 62690 14 9623 6 13
of<>the<>parent<>2 0.0066 1 36043 62690 17 9623 3 3

++
 References and More Information 
++
(1) NGram Statistics Package
 http://www.d.umn.edu/~tpederse/nsp.html
(2) 100 Statistical Tests by Gopal K. Kanji (ISBN: 0803987048)
 The KolmogorovSmirnov test is discussed on page 67
 A table of 'D' values is found on page 186
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
\\\///
\\ ~ ~ //
( @ @ )
**********************************oOOo(_)oOOo************************************
CS 8761  Natural Language Processing  Dr. Ted Pedersen
Assignment 2 : THE GREAT COLLOCATION HUNT
Due : Friday , October 11 , Noon
Author : Anand Takale ( )
***************************************OoooO***************************************
.oooO ( )
( ) ) /
\ ( (_/
\_)
Objective : To investigate various measures of association that can be used to identify
 collocations in large corpora of text. In particular identify and implement
a measure that can be used with 2 and 3 word sequences, and compare this
method with some other standard measures.(i.e. To identify a measure that
is suitable for both 2 or 3 word sequences that is not already a part of NSP
Introduction : For this assignment we have to implement "true" mutual information for 2 word
 sequences.Then we have to identify a measure that is not currently supported by
the NSP package.This measure should work for 2 as well as 3 word sequences.
We also have to implement the log likelihood ratio test for 3 word sequences.
After all the tests have been implemented we have to compare the 50 top bigrams
produced by each test. Also we have to find out whether any cutoff point for
scores occurs in the results. Then we have to compare the tests using the
rank.pl module of the NSP package.
CORPUS of text used : vikram.txt , crusoe.txt , donq.txt , holmes.txt , hamlet.txt , nafta.txt
 crime.txt , twist.txt , wstra10.txt
( Project Gutenberg  http://promo.net/pg/ )
All the files were placed in a directory called corpus and the tests were run
on this directory.

Experiment 1

1. Implement "true" mutual information for 2 word sequences. (tmi.pm)
2. Identify a measure that is currently not supported by NSP that is suitable for discovering
2 and 3 word collocations. (user2.pm)

Part 1 : Implementing "true" mutual information for 2 word sequences. ( Module tmi.pm )

"True" mutual information is a measurement of dependence between X and Y.
"True" mutual information is given by the formula :
I(X:Y) = (SUM X)(SUM Y) p(x,y) log [ p(x,y)/ (p(x)*p(y)) ]
where p(x,y) is the probability that the word x is the first token of the bigram and word y is the
second token of the bigram,
p(x) is the probability that x is the first token of the bigram
p(y) is the probability that y is the second token of the bigram
The following values are obtained as the output of count.pl program.
(1) The number of times both x & y occured together with y after x  f(x,y)
(2) The number of times x occured as the first token of the bigram  f(x)
(3) The number of times y occured as the second token of the bigram  f(y)
Once you have the frequencies you can biuld the 2 way contingency table as follows :
The 2 way contingency table is as follows :
_
y  y


x  (total x)


_  _
x  (total x)
 _
(total y) (total y) (total no. of bigrams)
We have the values of total no. of bigrams and values for xy , total x and total y directly from
output of count.pl .The rest of the values can be calculated and the table can be formulated. After
the table has been formulated there are four different values available
_ _ _ _
xy , xy . xy , x y
We also get the total number of the bigrams that occured in the input text i.e CORPUS
so we can find the probabilities as follows :
p(x,y) = f(x,y) / totalBigrams
p(x) = f(x) / totalBigrams
p(y) = f(y) / totalBigrams
Taking a summation of all the four values the "TRUE" mutual information for that bigram is calculated.
After the "TRUE" mutual information has been calculated for each and every bigram the output of
statistic.pl is compared with other outputs of statistic.pl obtained by using different measures of
associations. The output files obtained from statistic.pl are compared using rank.pl.
Rank.pl will output a floating point number between 1 and 1. A return of '1' indicates a perfect
match in rankings, '1' a completely reversed ranking and '0' a pair of rankings that are completely
unrelated to each other. Numbers that lie between these numbers indicate various degrees of
relatedness / unrelatedness / reverserelatedness.
Sample Calculation of TMI for a bigram :

The sample calculation for the bigram with the highest score in the true mutual information are as follows :
1425725
,<>and<>1 0.0199 17218 103199 42199
We can see that total number of bigrams occuring in the corpus are 1425725.
The bigrams with tokens (,) and (and) has the highest score of 0.0199
We also see that
xy = 17218
x = 103199
y = 42199
From these values we can also calculate the other values :
Filling the 2*2 table we get:
_
y  y 
 
x 17218  85981  103199
 

_  
x 24981  1297545  1322526
 
42199  1383526  1425725
From this table we can compute the TRUE mutual information of the bigram as follows
The TRUE mutual information is calculated by the following formula :
TMI = SUM over x,y ( p(x,y) * log [ p(x,y) / (p(x)*p(y))] )
Now summation over x and y means taking all the four combinations and adding them up.
_ _ _ _
xy , xy , xy , x y
After running the count.pl on CORPUS the file of bigrams called as corpus.cnt was created.
Now the scores of different bigram were calculated by running statistic.pl with tmi.pm
embedded into it as the statistic measure.
TOP 50 RANKS :

following are the top 15 rankings produced by the tmi.pm statistic.
,<>and<> 1 0.0199 17218 103199 42199
Don<>Quixote<>2 0.0150 2182 2792 2189
,<>,<> 3 0.0081 1 103199 103199
of<>the<> 4 0.0080 8026 35692 61040
.<>The<> 5 0.0078 2827 49770 3634
;<>and<> 6 0.0065 4199 14797 42199
.<>He<> 7 0.0062 2136 49770 2533
I<>am<> 8 0.0055 1450 21441 1679
don<>t<> 9 0.0051 747 747 2127
Mr<>.<> 10 0.0050 1467 1469 49770
in<>the<> 11 0.0048 4699 19621 61040
the<>,<> 12 0.0047 1 61040 103199
;<>but<> 13 0.0042 1597 14797 5276
.<>It<> 14 0.0041 1427 49770 1731
.<>But<> 15 0.0039 1350 49770 1625
We can see that at least all the top 15 rankings have different scores. We can also observe that
most of the top ranks produced by the tmi test consisted of punctuation marks as tokens in them.
So we can conclude that tmi test doesn't produce interesting collocations. i.e. looking at the
top ranked bigrams we cant say anything about the collocations i.e. we have little idea about which
two words occur the most. Although the words Don Quixote come together often they are not interesting
as they form a name of a person. Also we can see that some collocations are present in the top ranks.
Collocations like 'I am' and'in the' occur but they are at a considerable lower rank.
CUTOFF POINT :

Looking at the corpus.tmi we can observe that the highest ranked bigrams have unique values.
When the ranks go around 30 there are more than one bigrams with the same ranks. Hence we can
pinpoint the cutoff point at around ranks 25 to 30. We can observe from this readings that
the there are very few bigrams occuring more more of times and many bigrams that appear very
few times. This is in accordance with the Zipf's Law.

Part 2 : Identify a measure currently not supported by NSP package. The measure chosen was
 JT Test. The JT test is a variation of the U test.It is a popular test proposed by
Terpstra (1952) and Jonckheere (1954) independent of each other. The JT test uses a
MannWhitney statistic Uij for the two sample problem comparing samples i and j.
The test statistic is constructed by adding all the Uij's . Thus the test statistic is
JT coeff = U12 + U13 + .... + U1k + U23 + U24 + ..... + U2k + ... + Uk1k
The Mann Whitney U test based on the idea that the particular pattern exhibited when the
X and Y random variables are arranged together in increasing order of magnitude provides
information about the relationship between their populations.
The Mann Whitney U statistic is defined the number of times a Y precedes an X in the combined
ordered arrangement of the two independent random samples.
The formula for calculatin the JU statistic is as follows :
U = [ P(x,y)  p ] / [ sqrt ( p(1p)/n ) ]
where P(x,y) is the probability that x and y occur together
and p = P(x) * P(y) i.e. multiplication of individual properties
In the JT test first the 2*2 table was made the samw as in TRUE mutual information module.
The U values were calculated from this table and were added to get the JT coefficient.
After running the statistic.pl on the corpus with user2.pm as the statistic measure the following
bigrams were found to have the highest ranks. Now this test ranks the bigrams on the probability
of both the tokens in the bigram occuring together. We saw that in tmi.pm the bigram which occured
the most number of times was ranked first. However in this case the bigram with highest rank occured
only once. But the JT test gives a coefficient which tells us whether the probability of the two
tokens occuring together is more or less. So we see that the probability of the tokens occuring together
is 100% in the first two ranks.
Plaza<>Mayor<>1 1702365800.3925 1 2 2
ISO<>IEC<>1 1702365800.3925 1 2 2
Also we see that the most of the bigrams with tokens appearing together occur very few times. So again
the Zipfs Law holds. Most of the bigrams occur only once whereas only few bigrams occur many times.
We also conclude that the two test JT test and TMI are way different from each other. TMI concentrates
on most number of occurences while JT concentrates more on probability of tokens in a bigram occuring together.
This is also proved by running the two tests with rank.pl. The coefficient obtained is 0.9037 which proves
that the TMI and the JT tests are different from each other.
RANK COMPARISONS:

The rank of the tmi test with other tests is as follows :
(1) Pointwise Mutual Information (mi.pm) : 0.9031
(2) Log Likelihood (ll.pm) : 0.9024
The rank of the JT Test with other tests is as follows :
(1) Pointwise Mutual Information (mi.pm) : 0.6582
(2) Log Likelihood (ll.pm) : 0.3138
OVERALL RECOMMENDATION:
We can see that to get interesting collocation the True Mutual INformation test would be more
useful if it was run on a corpus without punctuatiojn marks. However in True Mutual information
even if a Bigram occured 2 times and the tokens appeared 100 times each and the bigrams appeared 2
times and the tokens appeared 2 times each have the same score. Thus if we want to see which collocation
have maximum probablity of appearing together then we should go for the JT test.




EXPERIMENT 2:

1. Implement a module named ll3.pm that performs the log likelihood ratio test for 3 word
sequences. Assume that the null hypothesis of the test is as follows :
p(w1).p(w2).p(w3) = p(w1,w2,w3)
2. Create your 3 word version of user2.pm and call it user3.pm
In the loglikelihood test for three word sequences there was a need to build the 3 way contingency table
first. This contingency table was build using the basic Demorgans law.
The formula used for loglikehood test is as follows :
log likelihood ratio = 2 * SUM [ observed * log ( observed / estimated ) ]
observed = f(xyz)
estimated = (f(x)*f(y)*f(z))/(n*n)
The count.pl was run on the corpus to produce corpus3.cnt for trigrams.
Then statistic.pl was run on both user3.pm as well as ll3.pm
The scores obtained were ranked by running rank.pl on corpus.ll3 and corpus.user3
The rank of the user3.pm abd ll3.pm was 0.2637 showing that the two tests are very different from each other.
Top 50 ranks :
==
Again in this case the top 50 ranks were different because the JT test takes into consideration the probability
that all the three tokens in the trigram appear together and they do not appear elsewhere.The proof that the
tests are very different from each other is proved by the rank coefficient obtained.
Cutoff point :

The same cutoff point was observed in the case of user3.pm as that of user2.pm
While in the case of loglikelihood there were many unique ranks at the beginnning i.e higher ranks
whereas there were many trigrams having same ranks at lower positions.
Overall Recommendations :

The JT test provides a way to tell ypu about whether the tokens in a ngram occur only in that particular
ngram and nowhere else in the corpus . However the JT test does not output interesting collocations.

References :

1. Readme and Newstats file of the NSP package on Dr. Ted Pederson's web page
http://www.d.umn.edu/~tpederse
2. CORPUS  taken from Project Gutenberg
http://promo.net/pg/
3. Foundations of Statistical Natural Language Processing
 Cristopher Manning and Hinrich Schutze
4. Non Parametric Statistical Inference  Formula for JT test
 Jean Dickinson Gibbons , Subhabrata Chakraborti ( Chapters 7 and 11 )
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
*****************
Name: Nan Zhang *
Date: 10/10/02 *
Class: CS8761 *
*****************
******************
* *
* Experience 1 *
* *
******************
*****************************************************************************************************************************
TOP 50 COMPARISON:
*****************************************************************************************************************************
The Corpus I use is <>.
Top 50 by tmi.pm
.<>D<>31056 1163809327.4667 310 11718 367
with<>a<>31057 1203381402.8231 295 2049 4561
him<>,<>31058 1210841293.8821 263 1470 21495
M<>.<>31059 1220667819.1382 331 331 11719
by<>the<>31060 1224848032.7068 290 1055 12566
then<>,<>31061 1263433879.9635 299 639 21495
me<>,<>31062 1318917540.1362 291 1356 21495
said<>the<>31063 1336223882.2826 300 1960 12566
on<>the<>31064 1344955969.0981 325 952 12566
I<>am<>31065 1363879681.2058 417 3849 440
,<>to<>31066 1374335186.5698 265 21495 6501
,<>which<>31067 1380125280.8556 303 21495 1495
you<>,<>31068 1396187976.1075 285 3370 21495
he<>had<>31069 1398364320.9748 374 2689 1856
with<>the<>31070 1398432295.5183 314 2049 12566
.<>But<>31071 1420887734.1110 380 11718 433
.<>You<>31072 1436573948.3186 376 11718 527
.<>And<>31073 1471446439.4109 389 11718 495
,<>then<>31074 1506154159.7267 363 21495 639
to<>be<>31075 1542982342.1917 391 6501 1359
,<>with<>31076 1545714947.3050 333 21495 2049
of<>a<>31077 1551260754.0983 348 6321 4561
Athos<>,<>31078 1556033698.3020 361 957 21495
I<>have<>31079 1584173054.4934 424 3849 1460
of<>his<>31080 1625082599.1097 383 6321 2910
the<>cardinal<>31081 1673144878.2925 443 12566 518
,<>but<>31082 1732749001.5084 397 21495 1208
.<>He<>31083 1745646036.6340 468 11718 521
and<>the<>31084 1754961703.7808 368 5350 12566
,<>you<>31085 1800977181.3863 376 21495 3370
,<>in<>31086 1847508856.0975 389 21495 3146
,<>that<>31087 1851306368.0374 389 21495 3225
;<>and<>31088 1861098232.2333 454 2829 5350
,<>my<>31089 1909670559.2590 435 21495 1412
;<>but<>31090 1994045442.5270 584 2829 1208
,<>as<>31091 2037463998.2530 462 21495 1578
at<>the<>31092 2046655568.0214 493 1493 12566
,<>who<>31093 2144662094.9485 509 21495 1055
,<>he<>31094 2290124564.3685 499 21495 2688
Artagnan<>,<>31095 2515455299.3788 574 1827 21495
,<>the<>31096 2889357459.5828 561 21495 12566
.<>I<>31097 2927902472.6048 669 11718 3849
.<>The<>31098 3213055003.6713 859 11718 982
in<>the<>31099 3368643268.6223 791 3146 12566
to<>the<>31100 3415678714.8315 748 6501 12566
,<>I<>31101 3735454005.1049 824 21495 3849
d<>Artagnan<>31102 4344794727.8653 1466 1494 1827
,<>said<>31103 5114710934.5145 1246 21495 1960
of<>the<>31104 6867623309.2519 1615 6321 12566
,<>and<>31105 12470406843.4868 3002 21495 5350
Top 50 by user2.pm
!<>a<>10700 0.9402 1 1896 4561
have<>.<>10701 0.9410 2 1460 11719
and<>,<>10702 0.9417 14 5350 21495
,<>s<>10703 0.9420 2 21495 783
you<>he<>10704 0.9427 1 3370 2688
his<>in<>10705 0.9433 1 2910 3146
of<>at<>10706 0.9453 1 6321 1493
you<>his<>10707 0.9470 1 3370 2910
,<>But<>10708 0.9474 1 21495 433
had<>and<>10709 0.9479 1 1856 5350
of<>in<>10710 0.9483 2 6321 3146
for<>to<>10711 0.9501 1 1593 6501
to<>for<>10711 0.9501 1 6501 1593
from<>.<>10712 0.9505 1 877 11719
?<>and<>10713 0.9506 1 1960 5350
,<>know<>10714 0.9509 1 21495 465
with<>and<>10715 0.9527 1 2049 5350
of<>.<>10716 0.9529 7 6321 11719
at<>,<>10717 0.9543 3 1493 21495
or<>,<>10718 0.9549 1 507 21495
,<>He<>10719 0.9561 1 21495 521
in<>I<>10720 0.9570 1 3146 3849
!<>the<>10721 0.9575 2 1896 12566
.<>will<>10722 0.9583 1 11718 1043
by<>.<>10723 0.9588 1 1055 11719
his<>,<>10724 0.9610 5 2910 21495
to<>with<>10725 0.9611 1 6501 2049
we<>,<>10726 0.9636 1 631 21495
in<>and<>10727 0.9691 1 3146 5350
.<>which<>10728 0.9708 1 11718 1495
of<>;<>10729 0.9709 1 6321 2829
to<>.<>10730 0.9736 4 6501 11719
from<>,<>10731 0.9737 1 877 21495
of<>,<>10732 0.9750 7 6321 21495
of<>to<>10732 0.9750 2 6321 6501
,<>!<>10733 0.9758 2 21495 1896
had<>.<>10734 0.9764 1 1856 11719
.<>said<>10735 0.9777 1 11718 1960
.<>?<>10735 0.9777 1 11718 1960
to<>,<>10736 0.9792 6 6501 21495
the<>said<>10736 0.9792 1 12566 1960
,<>me<>10737 0.9830 1 21495 1356
my<>,<>10738 0.9837 1 1412 21495
.<>and<>10739 0.9838 2 11718 5350
,<>him<>10740 0.9843 1 21495 1470
.<>that<>10741 0.9864 1 11718 3225
,<>?<>10742 0.9882 1 21495 1960
.<>of<>10743 0.9931 1 11718 6321
,<>.<>10744 0.9981 1 21495 11719
.<>,<>10744 0.9981 1 11718 21495
The output of tmi.pm is just as what I expect. The association is meaningful and theirs ranks are almost the same as that in the output of count.pl.
The output of user2.pm is like that of mi.pm very much. But the rank is to my surprise since the firstranked is ".<>,<>" and something that are meaningless rank highly. But I found that in the output of count.pl, rank of them are also very high. So I think it is the problem of the flaw of the measure instead of the program.
So, since tmi.pm seems better at identifying significant and interesting collocations, I think the measure of tmi is better than user2.
*****************************************************************************************************************************
CUTOFF POINT:
*****************************************************************************************************************************
I don't think I can find obvious cutoff points in the lists of bigrams as ranked by NSP for both tmi.pm and user2.pm, because the scores of both lists are distributed continuously instead of discretely. So the curve of scorerank should be smooth. And that means the the collocations of large corpus are distributed dispersedly instead of concentratedly.
*****************************************************************************************************************************
RANK COMPARISON:
*****************************************************************************************************************************
Judged from the top 50 comparision, tmi.pm is more like ll.pm, and user2.pm is more like mi.pm. What I observed are:
perl rank.pl output2.ll output2.tmi
Rank correlation coefficient = 0.1800
perl rank.pl output2.ll output2.user2
Rank correlation coefficient = 0.6996
perl rank.pl output2.mi output2.tmi
Rank correlation coefficient = 0.7158
perl rank.pl output2.mi output2.user2
Rank correlation coefficient = 0.9204
It seems that user2.pm is more like both ll.pm and mi.pm than tmi.pm.
The rank correlation coefficient between ll.pm and tmi.pm is very low because tmi.pm is not like ll.pm when the rank grows large. I think this is because that though they are similar when p(x,y) is large, when p(x,y) becomes very small, p(x,y)log(p(x,y)/p(x)p(y)) is much smaller than p(x,y).
*****************************************************************************************************************************
OVERALL RECOMMENDATION:
*****************************************************************************************************************************
I think tmi.pm is the best for identifying significant collocations in large corpora of text.
The modules of mi.pm and user2.pm don't work well when p(x,y) is large, and they both rank some meaningless association highly. Though tmi.pm and ll.pm perform similarly when p(x,y) is large, when p(x,y) becomes very small, since p(x,y)log(p(x,y)/p(x)p(y)) grows smaller more quickly than p(x,y), which embodies the distribution of collocations in large corpora, tmi.pm perform better than ll.pm. So tmi.pm is the best for identifying significant collocations in large corpora of text.
*****************************************************************************************************************************
******************
* *
* Experience 2 *
* *
******************
*****************************************************************************************************************************
TOP 50 RANK:
*****************************************************************************************************************************
The Corpus I use is <>.
Top 50 by ll3.pm
said<>d<>Artagnan<>1 6812109.6494 322 1960 1494 1827 322 322 1466
.<>d<>Artagnan<>2 6812147.0776 61 11718 1494 1827 64 371 1466
d<>Artagnan<>,<>3 6812880.4280 523 1494 1827 21495 1466 531 574
cried<>d<>Artagnan<>4 6813309.2328 91 447 1494 1827 91 91 1466
replied<>d<>Artagnan<>5 6813411.5904 74 378 1494 1827 74 74 1466
d<>Artagnan<>.<>6 6813426.2759 252 1494 1827 11719 1466 258 252
Monsieur<>d<>Artagnan<>7 6813565.4285 56 454 1494 1827 56 56 1466
d<>Artagnan<>;<>8 6813615.2360 95 1494 1827 2829 1466 96 95
the<>d<>Artagnan<>9 6813664.3558 2 12566 1494 1827 3 2 1466
asked<>d<>Artagnan<>10 6813687.7451 28 208 1494 1827 28 28 1466
d<>Artagnan<>the<>11 6813694.4313 8 1494 1827 12566 1466 8 8
d<>Artagnan<>s<>12 6813703.9910 39 1494 1827 783 1466 40 43
d<>Artagnan<>was<>13 6813709.5038 49 1494 1827 2511 1466 50 72
d<>Artagnan<>had<>14 6813714.3382 49 1494 1827 1856 1466 50 62
!<>d<>Artagnan<>15 6813737.2058 2 1896 1494 1827 2 27 1466
,<>d<>Artagnan<>16 6813737.6726 161 21495 1494 1827 161 161 1466
dear<>d<>Artagnan<>17 6813741.0051 19 188 1494 1827 19 19 1466
d<>Artagnan<>took<>18 6813744.0007 8 1494 1827 234 1466 8 19
d<>Artagnan<>of<>19 6813747.2098 2 1494 1827 6321 1466 2 2
murmured<>d<>Artagnan<>20 6813747.3751 13 65 1494 1827 13 13 1466
d<>Artagnan<>related<>21 6813753.6881 2 1494 1827 34 1466 2 9
a<>d<>Artagnan<>22 6813764.7120 1 4561 1494 1827 1 1 1466
demanded<>d<>Artagnan<>23 6813770.4532 8 31 1494 1827 8 8 1466
d<>Artagnan<>remained<>24 6813773.4664 2 1494 1827 97 1466 2 9
d<>Artagnan<>looked<>25 6813773.8246 2 1494 1827 99 1466 2 9
d<>Artagnan<>went<>26 6813773.9592 9 1494 1827 187 1466 9 14
d<>Artagnan<>felt<>27 6813774.6053 7 1494 1827 97 1466 7 11
d<>Artagnan<>blushed<>28 6813776.7168 1 1494 1827 12 1466 1 5
continued<>d<>Artagnan<>29 6813778.1083 12 170 1494 1827 12 12 1466
d<>Artagnan<>a<>30 6813778.9446 4 1494 1827 4561 1466 4 4
of<>d<>Artagnan<>31 6813780.3571 63 6321 1494 1827 63 63 1466
d<>Artagnan<>bowed<>32 6813781.2949 1 1494 1827 41 1466 1 6
d<>Artagnan<>did<>33 6813783.5557 8 1494 1827 365 1466 8 15
to<>d<>Artagnan<>34 6813785.0053 60 6501 1494 1827 60 60 1466
that<>d<>Artagnan<>35 6813785.9951 41 3225 1494 1827 41 41 1466
d<>Artagnan<>drew<>36 6813786.6299 2 1494 1827 80 1466 2 7
?<>d<>Artagnan<>37 6813786.6915 5 1960 1494 1827 5 20 1466
in<>d<>Artagnan<>38 6813788.4593 2 3146 1494 1827 2 2 1466
d<>Artagnan<>threw<>39 6813791.3619 1 1494 1827 45 1466 1 5
which<>d<>Artagnan<>40 6813791.5723 25 1495 1494 1827 25 25 1466
d<>Artagnan<>thought<>41 6813792.0459 3 1494 1827 146 1466 3 8
interrupted<>d<>Artagnan<>42 6813792.7101 5 28 1494 1827 5 5 1466
and<>d<>Artagnan<>43 6813795.5949 45 5350 1494 1827 45 45 1466
d<>Artagnan<>followed<>44 6813798.6235 3 1494 1827 86 1466 3 6
d<>Artagnan<>uttered<>45 6813799.1809 1 1494 1827 42 1466 1 4
d<>Artagnan<>believed<>46 6813799.3480 2 1494 1827 64 1466 2 5
d<>Artagnan<>ran<>47 6813799.3650 1 1494 1827 43 1466 1 4
d<>Artagnan<>could<>48 6813799.6802 7 1494 1827 337 1466 7 11
d<>Artagnan<>to<>49 6813799.7664 31 1494 1827 6501 1466 31 31
d<>Artagnan<>turned<>50 6813799.7933 4 1494 1827 79 1466 4 6
thought<>d<>Artagnan<>51 6813800.1218 7 146 1494 1827 7 7 1466
Top 50 by user3.pm
to<>,<>my<>12471 0.8713 1 6501 21495 1412 6 36 435
.<>She<>,<>12471 0.8713 1 11718 207 21495 176 1751 1
,<>to<>her<>12472 0.8714 1 21495 6501 1491 265 109 122
,<>so<>,<>12473 0.8732 1 21495 633 21495 70 1687 62
,<>you<>,<>12474 0.8756 5 21495 3370 21495 376 1687 285
Artagnan<>,<>of<>12475 0.8793 1 1827 21495 6321 574 7 83
,<>?<>said<>12476 0.8810 1 21495 1960 1960 1 43 180
,<>and<>me<>12477 0.8853 1 21495 5350 1356 3002 89 5
.<>This<>,<>12478 0.8858 1 11718 233 21495 200 1751 1
,<>on<>,<>12479 0.8893 1 21495 952 21495 111 1687 20
d<>Artagnan<>of<>12480 0.8918 2 1494 1827 6321 1466 2 2
was<>,<>the<>12481 0.8934 1 2511 21495 12566 62 105 561
have<>,<>and<>12482 0.8941 1 1460 21495 5350 21 5 3002
at<>,<>and<>12483 0.8966 1 1493 21495 5350 3 5 3002
I<>have<>,<>12484 0.8985 2 3849 1460 21495 424 168 21
,<>I<>and<>12485 0.8996 1 21495 3849 5350 824 72 3
he<>of<>the<>12486 0.8999 1 2689 6321 12566 2 148 1615
?<>I<>,<>12487 0.9001 1 1960 3849 21495 167 298 68
,<>had<>the<>12488 0.9004 1 21495 1856 12566 110 1401 49
to<>be<>.<>12489 0.9008 1 6501 1359 11719 391 463 8
and<>that<>,<>12490 0.9020 1 5350 3225 21495 170 295 56
you<>,<>to<>12491 0.9033 1 3370 21495 6501 285 108 265
of<>to<>the<>12492 0.9054 1 6321 6501 12566 2 55 748
,<>you<>.<>12493 0.9066 2 21495 3370 11719 376 303 226
;<>I<>,<>12494 0.9075 1 2829 3849 21495 211 116 68
I<>,<>to<>12495 0.9086 1 3849 21495 6501 68 175 265
,<>Yes<>,<>12496 0.9091 1 21495 314 21495 1 1687 240
to<>his<>,<>12497 0.9112 1 6501 2910 21495 175 572 5
,<>and<>is<>12498 0.9120 1 21495 5350 1850 3002 310 3
had<>,<>and<>12499 0.9161 1 1856 21495 5350 21 7 3002
and<>to<>the<>12500 0.9175 1 5350 6501 12566 45 268 748
said<>,<>and<>12501 0.9189 1 1960 21495 5350 118 4 3002
I<>,<>and<>12502 0.9193 2 3849 21495 5350 68 13 3002
.<>It<>,<>12503 0.9212 1 11718 362 21495 293 1751 1
a<>d<>Artagnan<>12504 0.9233 1 4561 1494 1827 1 1 1466
,<>to<>you<>12505 0.9286 1 21495 6501 3370 265 373 171
,<>and<>,<>12506 0.9301 10 21495 5350 21495 3002 1687 14
,<>.<>D<>12507 0.9305 1 21495 11718 367 1 2 310
,<>that<>,<>12508 0.9339 2 21495 3225 21495 389 1687 56
to<>,<>said<>12509 0.9417 1 6501 21495 1960 6 3 1246
,<>said<>to<>12510 0.9427 1 21495 1960 6501 1246 295 20
the<>d<>Artagnan<>12511 0.9453 2 12566 1494 1827 3 2 1466
of<>his<>,<>12512 0.9503 1 6321 2910 21495 383 612 5
,<>of<>his<>12513 0.9524 1 21495 6321 2910 83 293 383
his<>,<>and<>12514 0.9534 1 2910 21495 5350 5 104 3002
of<>,<>and<>12515 0.9537 2 6321 21495 5350 7 100 3002
,<>I<>.<>12516 0.9556 1 21495 3849 11719 824 303 8
,<>I<>,<>12517 0.9601 2 21495 3849 21495 824 1687 68
and<>,<>and<>12518 0.9706 1 5350 21495 5350 14 19 3002
,<>he<>,<>12519 0.9730 1 21495 2688 21495 499 1687 190
The top 50 of the output of ll3.pm is constructed by many trigrams that contain one name in the novel, but all these trigrams are meaningful and they all rank highly in the output of count.pl.
The top 50 of the output of user3.pm is constructed by many trigrams that don't rank highly in the output of count.pl.
So, since ll3.pm seems better at identifying significant and interesting collocations, I think the measure of ll3 is better than user3.
*****************************************************************************************************************************
CUTOFF POINT:
*****************************************************************************************************************************
Also, I don't think I can find obvious cutoff points in the lists of bigrams as ranked by NSP for both ll3.pm and user3.pm, because the scores of both lists are distributed continuously instead of discretely. So the curve of scorerank should be smooth. And that means the the collocations of large corpus are distributed dispersedly instead of concentrately.
*****************************************************************************************************************************
OVERALL RECOMMENDATION:
*****************************************************************************************************************************
I think ll3.pm is better for identifying significant collocations in large corpora of text. The trigrams in the output by ll3.pm are all meaningful and rank highly in the output of count.pl. Though many topranked trigrams contain one name, I think that's the problem of the special novel.