K-vec++: Approach For Finding Word Correspondences ==================================================== Version 0.2 Copyright (C) 2001-2002 Ted Pedersen, tpederse@d.umn.edu Nitin Varma, varm0003@d.umn.edu University Of Minnesota, Duluth 1. Introduction ---------------- This package is an implementation of the part of the K-vec algorithm[1] that finds word correspondences. Word correspondences are word pairs or phrases that are exact translations from one language to another language. The algorithm finds these word pairs in a parallel text which is a document in one language along with its translation in another language. This package also includes program to combine the output of different techniques and yield results better than any of the individual methods. Also, included is the program to evaluate the quality of lexicon created. Let the English text "Canadian Institute of Health Research Act." and the French text "Loi sue les Instituts Recherche en sante du Canada " form an English-French parallel text. Word "health" from the English text and the word "sante" from the French text forms a word correspondence. This Readme describes the general characteristics of the algorithm and provides its detailed description with the help of examples. It also provides a detailed description of the other programs in this package. 2. General characteristics of the K-vec Algorithm ------------------------------------------------- 1. K-vec algorithm is based on the fact that if two words are translations of each other, then they occur almost an equal number of times and approximately in the same region in the parallel text. 2. It does not require any prior knowledge of the punctuations or the sentence boundaries of the language under consideration. Hence, it is not focused on a particular language pair and may work for different language pairs where the words have been segmented. 3. K-vec deals only with word to word translations and cannot find phrasal translations. Phrasal translations are the cases where a word/group of words in one language is/are translated to word/group of words in another language. 3. Background concepts ------------------------ In this section we describe some of the terms we will be using during the K-vec algorithm description. 3.1 Two by Contingency tables ------------------------------ Contingency tables are a simple way of representing the data. Consider an example where you want to categorize the students depending on whether or not they have registered for physics and chemistry courses. Suppose the sample size under consideration is of 10 students. Let the number of students registered for both the physics and chemistry course, only for physics course and only for chemistry course be 3, 2, and 1 respectively. This data can be represented in two by two matrix called as contingency tables as follows. Y --------------------------------------------+ | physics | !physics | totals | +------------+---------------+--------------| X chemistry | a=3 | c=1 | a+c=4 | !chemistry | b=2 | d=4 | b+d=6 | --------+------------+---------------+--------------| totals | a+b=5 | c+d=5 | T=a+b+c+d | ----------------------------------------------------+ Cell 'a' represents the total number of students registered for both the courses. Cell 'b' and 'c' and represent the total number of students registered only for physics and chemistry respectively. Representing data in this forms helps to get additional information. The first row represents the total number of students who have registered for the chemistry course and the second row represents the total number of students who have not registered for the chemistry course. Similarly, the first column represents the total number of students who have registered for physics course and the second column represent the total number of students who have not registered for physics course. 3.2 Tests of association -------------------------- Tests of association are statistical tests for finding the similarity between two entities. Once a contingency table is created, several tests of association like Pointwise Mutual Information, T-score, Log-likelihood ratio, Dice co-efficient, Odds ratio, Right sided Fisher's test, etc... can be used to find whether or not the two variables in the contingency table are related depending on the scores produced by these tests. Fung and Church use T-score and Pointwise Mutual Information as the measures which are explained below. 3.2.1 Pointwise Mutual Information ------------------------- Pointwise Mutual Information is the ratio of the probability of the two events X and Y occurring together to the combined probability of the two events occurring independently. This is a direct comparison of what is observed to what would be expected if the two events were independent. If these probabilities are the same then the two events are independent (and the PMI score will be 0.0). PMI = log2 ( prob(X=chemistry, Y=physics) / ( prob(X=chemistry) prob(Y=physics)) If PMI = 0, then the two variables are independent. PMI > 0, then the two variables are related. If the observed probability is greater than the product of the two unconditional probabilities then evidence for dependence is higher. Since there is no exact point at which we cross the line from independence to dependence the interpretation of this test and most of the others is something of an art. 3.2.2 T-score ------------- The T-score is defined as a ratio of difference between the observed and the expected mean to the variance of the sample. Note that this is a variant of the standard t-test that was proposed for use in the identification of collocations in large samples of text. t-score = prob(X=chemistry, Y=physics) - ( prob(X=chemistry) prob(Y=physics) ) ---------------------------------------------------------------------- sqrt((1/T) prob(X=chemistry, Y=physics)) 4. The K-vec Algorithm ---------------------- 1. The K-vec algorithm divides the two texts into k different pieces. A piece then consists of certain number of words from the text. 2. For each word in both the texts it checks whether the word occurs in a piece and represents its distribution in the form of a k-dimensional binary vector(k is the number of pieces). Occurrence of a word in a particular piece is indicated by a value '1' in the binary vector. A value '0' indicates that there was not a single occurrence of that word in the piece. It does not consider the frequency of a word in a particular piece. Consider an example where English-French parallel text is divided into 10 pieces. Consider an English word "king" to occur 5 times in piece 2 and 7 times in pieces 8, then the binary vector for the the word "king" will be V(king) = <0, 1, 0, 0, 0, 0, 0, 1, 0, 0> 3. It forms a two by two contingency table for the word pair where the two words are from the texts of the two languages under consideration. For example consider a word pair "king"->"roi", where "king" is a word from the English text and "roi" is a word from the French text Y ----------------------------------------------+ | roi | !roi |totals | +--------------+-----------------+------------| X king | a | c | a+c | !king | b | d | b+d | --------+--------------+-----------------+------------| totals | a+b | c+d | T=a+b+c+d | ------------------------------------------------------+ Cell 'a' is filled by counting the total number of times 1 occurs in the corresponding positions of the vectors of the two words under consideration. It represents the number of times a piece contains the word "king" and the corresponding piece contains the word "roi". Cell 'a+c' is filled by counting the total number of times 1 occurs in the vector for the word "king". It represents the total number of pieces containing the word "king". Cell 'c' is filled by subtracting the value in the cell 'a' from the value in the cell 'a+c'. It represents the number of times a piece contains the word "king" and the corresponding piece does not contain the word "roi". Cell 'a+b' is filled by counting the number of times 1 occurs in the vector for the word "roi". It represents the total number of pieces containing the word "roi". Cell 'b' is filled by subtracting the value in the cell 'a' from the value in the cell 'a+b'. It represents the number of times a piece contains the word "roi" and the corresponding piece does not contain the word "king". Cell 'b+d' and 'c+d' is filled by counting the total number of times 0 occur in the vector for the word "king" and "roi" respectively. 'b+d' represents the total number of pieces which do not contain the word "king". 'c+d' represents the total number of pieces which do not contain the word "roi". Cell 'd' is filled by subtracting the value in the cell 'b' from the value in the cell 'b+d'. It represents the number of times a piece does not contain the word "king" and the corresponding piece does not contain the word "roi". Cell 'a+b+c+d' is filled by adding the values from the cell 'a', 'b', 'c' and cell 'd'. It represents the total number of pieces into which each of the text is divided. 4. It then uses a test of association to find how closely related the two words are. K-vec uses Pointwise Mutual Information(PMI) and T-score as tests of associations. According to K-vec, MI is not a good test of association when counts are small. Hence, it prefers to use T-score over PMI. 4.1 Number of pieces ---------------------------------- If the number of pieces into which the text is to be divided is very large, then total number of words in each piece is small and a word and its translation may not occur in the corresponding pieces. K-vec may miss such translations as it looks for the word translations in corresponding pieces. If the number of pieces is very small, then the number of words in each piece will be large and the basic advantage of dividing the text into pieces and looking for a word and its translation into corresponding piece is lost. Fung and Church suggest that K-vec divide the text into a number of pieces equal to the square root of the total number of word tokens in the text. For huge data the number of word tokens in each piece will therefore be large. We thus believe that one does not always get best results by dividing the text in pieces equal to the square root of the total number of words in the text and evaluate this in our experimental work. 4.2 Determining candidate translation ------------------------------------------ Fung and Church do not consider all the word pairs from the two texts as possible or candidate translations because there will be too many such word pairs. They restrict the algorithm to word pairs with frequencies between 3 and 11. They do not consider the low frequency word pairs because the amount of information about these words is not sufficient to find a translation. The words that make high frequency word pairs in both languages in the parallel text will occur in almost every piece. K-vec looks for word correspondences in corresponding pieces and therefore every high frequency word in one language will be considered as a translation of the high frequency words in another language. For example, if there are n high frequency words in one language and m high frequency words in another language, then there will be almost nm word pairs formed using these words and only n of these can be correct. Measures of association will be unable to differentiate such words and do not consider them for translation. While the idea of a frequency cutoff is sound, we do not believe that an upper limit of 11 (used by Fung and Church) is always the correct choice, but should rather be dependent on the number of pieces. For a more detailed description of the K-vec algorithm refer to [1]. 4.3 Example for the K-vec Algorithm ------------------------------------ The following is an example from the K-vec paper[1] to show how the algorithm works. Consider a sample of English and French text with the English text having 1000 word and the French text having 900 words. It was found that the word fisheries occurs 19 times in the English text and the word peches occurs 21 times in the French text. 1. Consider the English and the French texts to be divided into 10 pieces such that each piece of the English text contains 100 words while each piece of the French text contains 90 words. 2. Out of the 19 occurrences for the word fisheries in the English text, it was found that it occurs 3 times in piece 2 and 16 times in piece 8. So the binary vector for the word fisheries is V(fisheries) = < 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 > Out of the 21 occurrences for the word peches in the French text, it was found that it occurs 4 times in piece 2 and 17 times in piece 8. So the binary vector for the word peches is V(peches) = < 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 > 3. Create a two by two contingency table for fisheries->peches. Y --------------------------------------------+ | peches | !peches | totals | +------------+---------------+--------------| X fisheries| a=2 | c=0 | 2 | !fisheries| b=0 | d=8 | 8 | --------+------------+---------------+--------------| totals | 2 | 8 | T=10 | ----------------------------------------------------+ 4. Use different measures of association to find whether or not the word "fisheries" and "peches" form a good translation pair. PMI = log2 ( prob(fisheries, peches) / ( prob(fisheries) prob(peches) ) ) = log2 ( (0.2) / ( (0.2) (0.2) ) ) = 2.3219 PMI value of 2.3219 suggests that the words "fisheries" and "peches" are translations of each other. t-score = ( prob(fisheries, peches) - ( prob(fisheries) prob(peches) ) ) -------------------------------------------------------------- sqrt((1/T) prob(fisheries, peches)) = (0.2) - (0.2)(0.2) ------------------ sqrt((1/10)(0.2)) = 1.314 5. Program kvec.pl ------------------ kvec.pl is the Perl implementation of the K-vec algorithm. Usage: perl kvec.pl [{[--pieces K --lcutoff N --ucutoff M] TARGET SOURCE |--help|--version}] SOURCE is the text in one language. TARGET is the text in another language. This program takes as input two files which are translations of each other in two different languages and outputs... 1. N i.e. the total number of pieces. 2. For each word pair XY (where X is a word from SOURCE and Y is a word from TARGET), it outputs 'a' - The total number of times a piece of the English text contains the word X and the the corresponding piece of the French text contains the word Y. 'b' - The total number of pieces containing the word X. 'c' - The total number of pieces containing the word Y. The format of the OUTPUT is as follows N word from SOURCE<>word from TARGET<> a a+b a+c word from SOURCE<>word from TARGET<> a a+b a+c word from SOURCE<>word from TARGET<> a a+b a+c . . . word from SOURCE<>word from TARGET<> a a+b a+c kvec.pl also has following command line options... a. --lcutoff N - Allows the user to specify an integer value N, such that, only the word pairs formed using the words occurring equal or more number of times in the corresponding pieces are considered as valid corresponding pairs and present in the output of K-vec. For example, if user specifies cutoff to be 7, all the word pairs with frequency less than 7 will not be considered as valid word pairs. By default, the lcutoff frequency is 4. b. --ucutoff M - Allows the user to specify an integer value M, such that, only the word pairs formed using the the words occurring equal or less number of times in the corresponding pieces are considered as valid pairs and present in the output of K-vec. For example, if user specifies cutoff to be 10, all the words with frequency more than 10 will not be considered as valid word pairs. By default, it considers all the words above the lcutoff. c. --pieces K - Allows the user to specify how many pieces the source text and target text should be divided into. By default the number of pieces in which both the source text and target text are divided is set to the square root of total number of words in the source text. d. --version - Displays the version number and copyright information. e. --help - Displays the description about various command-line options. 6. Program "statistic.pl" ----------------------- It is a Perl program from the Bigram Statistics Package(BSP) by Ted Pedersen and Satanjeev Banerjee. "statistic.pl" can be used to apply different tests of associations and find relations between different entities which in our case are the words from different languages. We use tests/measures like like Pointwise Mutual Information, Dice Co-efficient, Log-likelihood ratio, etc... implemented in BSP. Also, we use test/measures like Odds ratio, T-score and right sided Fisher's test implemented as part of this package. We use statistic.pl for applying these tests to the output of kvec.pl to find the best word correspondences. "statistic.pl" ranks the word correspondences depending on their value (calculated using statistics). Higher the rank, better is the word correspondence. "statistic.pl" has the following command-line options: a. --frequency N - Allows the user to specify an integer value N, such that, the statistic tests will be applied to the word pairs XY only if they occur together in at least N corresponding pieces. For example, if N is 5, then the statistic test will be applied only if the word X and the word Y occur in 5 or more than 5 corresponding pieces. b. --rank N - Allows the user to specify an integer value N, such that all the word correspondences having rank higher than N, will not be be considered as a good word correspondences and will not be present in the output. c. --score - Allows the user to specify an value N, such that all the word correspondences having score less than N, will not be be considered as a good word correspondences and will not be present in the output. Usage: statistic.pl STATISTICS_LIBRARY DEST SOURCE Here STATISTICS_LIBRARY can be any of the tests of association such as mi, dice, ll(log-likelihood), etc... SOURCE is the file created by kvec.pl. DEST is the output file. ***NOTE**** To run statistic.pl user should have BSP in the path. For example we have an entry in the .cshrc file /home/cs/varm0003/BSP BSP is the directory where statistic.pl program and the perl modules for all the tests of association are present. ********** For details about the program "statistic.pl" please refer to the BSP README[2] file. 6.1 Example ----------- Consider the example and the contingency table we created in the K-vec algorithm description. For the word pair fisheries->peches we will have the following values 10 fisheries<>peches<>2 2 2 Suppose this is given as input to "statistic.pl" with MI as STATISTICS_LIBRARY, the output will be 10 fisheries<>peches<>1 2.3219 2 2 2 Here 10 is the total number of pieces. fisheries(word from SOURCE text) peches(word from TARGET text) 1(rank) 2.3219(MI value for this pair) remaining values are a, (a+b), (a+c) (as described in section 5) 7. Tests/measures of association perl modules ----------------------------------------- We implemented various tests/measures of association, T-score(tscore.pm), Odds ratio(odds.pm), phi-Coefficient(phi.pm) and right sided Fisher's test(rightFisher.pm) as a part of this package. Bigram Statistics Package[2] allows user to write his own statistic package and plug it with statistic.pl. For instance, to use T-score as test of association in statistic.pl, use the following command statistic.pl tscore.pm eng.french.tscore eng.french where tscore.pm is the STATISTICS_LIBRARY eng.french.tscore is the output file. eng.french is input file(created by kvec.pl) In the K-vec algorithm, Fung and Church[1] prefer to use t-score over mi. This is because mi is unreliable for words with low frequencies and may rank such words very high. 8. "ensemble.pl" --------------- "ensemble.pl" is a perl program to combine the output of different files produced by statistic.pl. Instead of considering all the word pairs produced by statistic.pl as word correspondences, ensemble.pl considers a words in a pair to be translation of each other only if receives specified number of votes(i.e. present in those many output files). Usage: perl ensemble.pl [file1 file2{file3, file4...}]{--help|--version|--votes} Here, file1, file2, ... are input files i.e. output files of statistic.pl "ensemble.pl" has following command line options a. --version - Displays the version number and copyright information. b. --help - Displays the description about various command-line options. c. --votes - This is minimum number of input files(for ensemble.pl) that the word pair has to be present in to be considered as a word correspondence. By default, the value of votes is 1 and the value cannot be greater than number of input files. Example perl ensemble.pl --votes 2 file1 file2 file3 In the above case a word pair will be considered as valid word correspondence and will be present in the output of ensemble.pl only if it present in at least 2 of the 3 input files. For example if the word pair "fisheries peches" occurs in the required number of the output files. Then there is an entry for the word pair as follows fisheries<>peches We kept the output format similar to the format of the statistic,pl program. We can thus use the same evaluation program "eval.pl" described below to find the precision, recall and f-measure values. 9. eval.pl ---------- "eval.pl" is a perl program to compare the output of statistic.pl or ensemble.pl with the gold standard data and compute the precision, recall and f-measure values. usage:eval.pl [ { --help | --version } DATAFILE1 DATAFILE2 ] DATAFILE1 is the GOLD_DATA DATAFILE2 is the OUTPUT of the statistic.pl or ensemble.pl program 10. Test Run ----------- This section will describe the entire procedure to find the word correspondences using the kvec.pl and statistic.pl programs. Run kvec.pl for the parallel text provided with the package perl kvec.pl --lcutoff 3 --ucutoff 11 French.txt English.txt With the above command both the texts will be divided into number of pieces equal to the square root of the total number of words in the source text. Only the word pairs with frequency between 3 and 11 will be be considered as valid word pairs. ***Run statistic.pl with different test and command-line options*** statistic.pl --rank 25 dice eng.french.dice eng.french With the above command the statistic.pl will rank the word correspondences according to the dice values, but only the word correspondences with a rank less than or equal to 25 will be considered as valid translation pairs. statistic.pl --frequency 3 mi eng.french.mi eng.french With the above command the statistic.pl will rank the word correspondences according to the mi values, but only the words which occur together in three or more will be considered as valid translation pairs. statistic.pl --score 10 ll eng.french.ll eng.french With the above command the statistic.pl will rank the word correspondences according the log-likelihood values, but only the word correspondences with the values greater than equal to 10 will be considered as valid translation. One can try all possible combination of these tests, and command-line options. Once output files using statistic.pl are created, ensemble.pl can be used to combine word pairs present in these output files depending on whether or not they are present in specified number of input files. perl ensemble.pl --votes 3 eng.french.ll eng.french.mi eng.french.dice Here the files eng.french.ll, eng.french.mi and eng.french.dice are output files of statistic.pl for the Log likelihood ratio, Dice Co-efficient and Pointwise Mutual Information respectively. These files are given as input to ensemble.pl. Votes is the number of input files the word pair has to be present in to be considered as a valid translation pair and present in the output of ensemble.pl. For the above example, a word pair needs to be present in all of the 3 input files to be considered in the output of ensemble.pl i.e a valid translation pair. 11. System Requirements ---------------------- 1. Perl(v5.6.0). This package is written in Perl. Perl is freely available from http://www.perl.org 2. BSP(Bigram Statistics Package- version 0.5(BETA)(strict requirement)). This package is written in Perl. BSP is freely available from http://www.d.umn.edu/~tpederse/code.html 12. Copying ------------ This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 13. Acknowledgment ------------------ We would like to thank Fung. P. & K. W. Church, for the examples and the description of the K-vec algorithm which has made the task much easier. 13. References -------------- 1. Fung. P. & K. W. Church. (1994) "K-vec: A New Approach for Aligning Parallel Texts." Proceedings from the 15th International Conference on Computational Linguistics, Kyoto. 2. Bigram Statistics Package by Ted Pedersen & Satanjeev Banerjee. http://www.d.umn.edu/~tpederse/code.html