Here are some further post-mortem comments on assignment 2, the Great
Collocation Hunt.
user2 and user3 measures:
-------------------------
I was very interested in the range of measures of association that
were employed in your assignments. Below I've listed all the tests and
the number of students who selected each. I've tried to group them as
Tests versus Measures, although in a couple of cases I'm not clear on
what the proper categorization should be. (Please let me know if you see
that your test has been misclassified!)
Tests of Association (can be assigned statistical significance)
T-score (2)
Mann Whitney U-test (2)
Bi Predictive Exact Test (aka Freeman Halton Test) (1)
Log Linear Model of n-way Association (1)
JT Test (1)
Poisson Distribution (1)
Kolgorov Smirnov Test (1)
Measures of Association (provide a descriptive score):
Odds Ratio (2)
Odds Ratio with smoothing (1)
Log Odds Ratio (1)
Goodman Kruskal Lambda (aka Symmetric Lambda) (1)
Goodman Kruskal Gamma (1)
phi Coefficient (1)
Yules Coefficient of Colligation (2)
Mutual Expectation (1)
Russel Rao Coefficient (1)
--------------------------
Odds Ratio:
-----------
There are many things to say about each of these tests.
However, here are a few thoughts on the Odds Ratio, which
has the virtue of being fairly easy to formulate and has
a nice intuitive motivation. As a result it seemed to generate
quite a bit of discussion.
The odds ratio is a good example of a measure that does not rely
on expected values (relative to a hypothesis of independence).
To refresh our recollection of the odds ratio, it is defined as
OR = (n11*n22)/(n12*n22)
n11 n12
n21 n22
where nij are our cell counts...
Note that the odds ratio can give a high score to those
bigrams that always occur together or those that never
never occur together. For example, the following two
tables have identical odds ratios...
10 30
40 20
20 30
40 10
The odds ratio is telling us that the words in this bigram occur
together or not at all a total of (10*20 == 20*10) times. So the
odds ratio doesn't make a distinction between occurring together
and never occurring together.
Now of course for many bigram tables the value of n22 is very
large and will tend to dominate the value of this score, and
that might be a limitation of the approach.
-----------------------------------------
Log Likelihood Ratio versus True Mutual Information:
-----------------------------------------------------
The log likelihood ratio and true mutual information are
essentially the same measure, and only differ by a scaling
factor. As such their rankings should be the same. However,
many of you concluded (based on the Spearman correlation
coefficient as computed by rank.pl) that they are quite
different. I think that can be attributed to two possibilities.
1) Gave positive values to TMI and negative scores to LL (both
should in fact be positive). In that case the rankings produced
by each should be exactly opposite, and get a correlation
coefficient of -1.
2) Not enough precision for TMI values. The default precision
for NSP is 4 digits. This is not nearly enough for TMI, which
generates very small values. If you did not reset precision to
8 or more digits you would end up with a large number of bigrams
assigned to the last rank with a score of .0000
------------
Useful NSP options:
-------------------
There are a variety of options you could have used in NSP to tinker with
your results. A combination of frequency cutoffs and stop lists is a
typical way to "improve" results, in that you can eliminate low frequency
and/or non-content words.
--precision N
Reset precision of statistic.pl to N digits of precision. The default
is 4, which is not enough for TMI (see discussion above).
--set_freq_combo FILE
specify which combinations need to be counted by count.pl when dealing
with n-grams where n > 2. In the case of this assignment for trigrams, you
only needed to count the individual occurrences of words and the
three-way count of them all occurring together. You did not need to keep
track of the various pairwise combinations. This was due to our assumption
that the hypothesis of independence was p(X,Y,Z) = p(X)*p(Y)*p(Z) Not
keeping track of the pairwise combinations could be a significant help
to performance.
--stop FILE
Tells count.pl to ignore all the ngrams made up entirely of words in this
file. A good way to get rid of words like "the" and bigrams like "for
the". These ngrams are not counted and are removed from the sample.
--frequency N
Tells count.pl/statistic.pl not to display words that occur less than N
times. However, the words that are not displayed are still counted and
included in the sample size.
--remove N
Tells count.pl not to count words that occur less than N times. These
words are not counted and not included in the sample size. This can have a
radical effect on the outcome of the tests (since the elimination of low
frequency words will result in large changes in the sample size).
Ted Pedersen
November 1, 2002