Here are some further post-mortem comments on assignment 2, the Great 
Collocation Hunt. 

user2 and user3 measures: 
-------------------------

I was very interested in the range of measures of association that 
were employed in your assignments. Below I've listed all the tests and  
the number of students who selected each. I've tried to group them as  
Tests versus Measures, although in a couple of cases I'm not clear on  
what the proper categorization should be. (Please let me know if you see  
that your test has been misclassified!) 

Tests of Association (can be assigned statistical significance)

T-score (2)
Mann Whitney U-test (2) 
Bi Predictive Exact Test (aka Freeman Halton Test) (1)
Log Linear Model of n-way Association (1)
JT Test (1)
Poisson Distribution (1) 
Kolgorov Smirnov Test (1)

Measures of Association (provide a descriptive score):

Odds Ratio (2)
Odds Ratio with smoothing (1)
Log Odds Ratio (1)
Goodman Kruskal Lambda (aka Symmetric Lambda) (1)
Goodman Kruskal Gamma (1)
phi Coefficient (1)
Yules Coefficient of Colligation (2) 
Mutual Expectation (1)
Russel Rao Coefficient (1)

--------------------------

Odds Ratio:
-----------

There are many things to say about each of these tests.
However, here are a few thoughts on the Odds Ratio, which 
has the virtue of being fairly easy to formulate and has
a nice intuitive motivation. As a result it seemed to generate
quite a bit of discussion. 

The odds ratio is a good example of a measure that does not rely 
on  expected values (relative to a hypothesis of independence). 

To refresh our recollection of the odds ratio, it is defined as
OR = (n11*n22)/(n12*n22) 

n11 n12
n21 n22

where nij are our cell counts...

Note that the odds ratio can give a high score to those 
bigrams that always occur together or those that never
never occur together. For example, the following two
tables have identical odds ratios...

10 30
40 20

20 30
40 10

The odds ratio is telling us that the words in this bigram occur 
together or not at all a total of (10*20 == 20*10) times. So the
odds ratio doesn't make a distinction between occurring together
and never occurring together. 

Now of course for many bigram tables the value of n22 is very 
large and will tend to dominate the value of this score, and 
that might be a limitation of the approach.

-----------------------------------------

Log Likelihood Ratio versus True Mutual Information:
-----------------------------------------------------

The log likelihood ratio and true mutual information are
essentially the same measure, and only differ by a scaling
factor. As such their rankings should be the same. However,
many of you concluded (based on the Spearman correlation
coefficient as computed by rank.pl) that they are quite
different. I think that can be attributed to two possibilities.

1) Gave positive values to TMI and negative scores to LL (both
should in fact be positive). In that case the rankings produced
by each should be exactly opposite, and get a correlation 
coefficient of -1. 

2) Not enough precision for TMI values. The default precision
for NSP is 4 digits. This is not nearly enough for TMI, which
generates very small values. If you did not reset precision to 
8 or more digits you would end up with a large number of bigrams
assigned to the last rank with a score of .0000 

------------

Useful NSP options:
-------------------

There are a variety of options you could have used in NSP to tinker with 
your results. A combination of frequency cutoffs and stop lists is a 
typical way to "improve" results, in that you can eliminate low frequency 
and/or non-content words. 

--precision N 

Reset precision of statistic.pl to N digits of precision. The default
is 4, which is not enough for TMI (see discussion above). 

--set_freq_combo FILE

specify which combinations need to be counted by count.pl when dealing   
with n-grams where n > 2. In the case of this assignment for trigrams, you  
only needed to count the individual occurrences of words and the  
three-way count of them all occurring together. You did not need to keep  
track of the various pairwise combinations. This was due to our assumption  
that the hypothesis of independence was p(X,Y,Z) = p(X)*p(Y)*p(Z) Not
keeping track of the pairwise combinations could be a significant help
to performance. 

--stop FILE

Tells count.pl to ignore all the ngrams made up entirely of words in this  
file. A good way to get rid of words like "the" and bigrams like "for  
the". These ngrams are not counted and are removed from the sample. 

--frequency N 

Tells count.pl/statistic.pl not to display words that occur less than N  
times. However, the words that are not displayed are still counted and
included in the sample size. 

--remove N 

Tells count.pl not to count words that occur less than N times. These 
words are not counted and not included in the sample size. This can have a  
radical effect on the outcome of the tests (since the elimination of low  
frequency words will result in large changes in the sample size). 


Ted Pedersen
November 1, 2002