Here are some further post-mortem comments on assignment 2, the Great Collocation Hunt. user2 and user3 measures: ------------------------- I was very interested in the range of measures of association that were employed in your assignments. Below I've listed all the tests and the number of students who selected each. I've tried to group them as Tests versus Measures, although in a couple of cases I'm not clear on what the proper categorization should be. (Please let me know if you see that your test has been misclassified!) Tests of Association (can be assigned statistical significance) T-score (2) Mann Whitney U-test (2) Bi Predictive Exact Test (aka Freeman Halton Test) (1) Log Linear Model of n-way Association (1) JT Test (1) Poisson Distribution (1) Kolgorov Smirnov Test (1) Measures of Association (provide a descriptive score): Odds Ratio (2) Odds Ratio with smoothing (1) Log Odds Ratio (1) Goodman Kruskal Lambda (aka Symmetric Lambda) (1) Goodman Kruskal Gamma (1) phi Coefficient (1) Yules Coefficient of Colligation (2) Mutual Expectation (1) Russel Rao Coefficient (1) -------------------------- Odds Ratio: ----------- There are many things to say about each of these tests. However, here are a few thoughts on the Odds Ratio, which has the virtue of being fairly easy to formulate and has a nice intuitive motivation. As a result it seemed to generate quite a bit of discussion. The odds ratio is a good example of a measure that does not rely on expected values (relative to a hypothesis of independence). To refresh our recollection of the odds ratio, it is defined as OR = (n11*n22)/(n12*n22) n11 n12 n21 n22 where nij are our cell counts... Note that the odds ratio can give a high score to those bigrams that always occur together or those that never never occur together. For example, the following two tables have identical odds ratios... 10 30 40 20 20 30 40 10 The odds ratio is telling us that the words in this bigram occur together or not at all a total of (10*20 == 20*10) times. So the odds ratio doesn't make a distinction between occurring together and never occurring together. Now of course for many bigram tables the value of n22 is very large and will tend to dominate the value of this score, and that might be a limitation of the approach. ----------------------------------------- Log Likelihood Ratio versus True Mutual Information: ----------------------------------------------------- The log likelihood ratio and true mutual information are essentially the same measure, and only differ by a scaling factor. As such their rankings should be the same. However, many of you concluded (based on the Spearman correlation coefficient as computed by rank.pl) that they are quite different. I think that can be attributed to two possibilities. 1) Gave positive values to TMI and negative scores to LL (both should in fact be positive). In that case the rankings produced by each should be exactly opposite, and get a correlation coefficient of -1. 2) Not enough precision for TMI values. The default precision for NSP is 4 digits. This is not nearly enough for TMI, which generates very small values. If you did not reset precision to 8 or more digits you would end up with a large number of bigrams assigned to the last rank with a score of .0000 ------------ Useful NSP options: ------------------- There are a variety of options you could have used in NSP to tinker with your results. A combination of frequency cutoffs and stop lists is a typical way to "improve" results, in that you can eliminate low frequency and/or non-content words. --precision N Reset precision of statistic.pl to N digits of precision. The default is 4, which is not enough for TMI (see discussion above). --set_freq_combo FILE specify which combinations need to be counted by count.pl when dealing with n-grams where n > 2. In the case of this assignment for trigrams, you only needed to count the individual occurrences of words and the three-way count of them all occurring together. You did not need to keep track of the various pairwise combinations. This was due to our assumption that the hypothesis of independence was p(X,Y,Z) = p(X)*p(Y)*p(Z) Not keeping track of the pairwise combinations could be a significant help to performance. --stop FILE Tells count.pl to ignore all the ngrams made up entirely of words in this file. A good way to get rid of words like "the" and bigrams like "for the". These ngrams are not counted and are removed from the sample. --frequency N Tells count.pl/statistic.pl not to display words that occur less than N times. However, the words that are not displayed are still counted and included in the sample size. --remove N Tells count.pl not to count words that occur less than N times. These words are not counted and not included in the sample size. This can have a radical effect on the outcome of the tests (since the elimination of low frequency words will result in large changes in the sample size). Ted Pedersen November 1, 2002