K-vec++: Approach For Finding Word Correspondences
                ====================================================
                                Version 0.2
                          Copyright (C) 2001-2002
 
                      Ted Pedersen, tpederse@d.umn.edu 
                       Nitin Varma, varm0003@d.umn.edu 
                       University Of Minnesota, Duluth
                       

1. Introduction
----------------

This package is an implementation of the part of the K-vec
algorithm[1] that finds word correspondences. Word correspondences are
word pairs or phrases that are exact translations from one language to
another language. The algorithm finds these word pairs in a parallel
text which is a document in one language along with its translation in
another language. This package also includes program to combine the
output of different techniques and yield results better than any of
the individual methods. Also, included is the program to evaluate the
quality of lexicon created.

Let the English text "Canadian Institute of Health Research Act." and
the French text "Loi sue les Instituts Recherche en sante du Canada "
form an English-French parallel text. Word "health" from the English
text and the word "sante" from the French text forms a word
correspondence.

This Readme describes the general characteristics of the algorithm and
provides its detailed description with the help of examples. It also
provides a detailed description of the other programs in this package.


2. General characteristics of the K-vec Algorithm
------------------------------------------------- 

1. K-vec algorithm is based on the fact that if two words are translations of
   each other, then they occur almost an equal number of times and
   approximately in the same region in the parallel text.
 
2. It does not require any prior knowledge of the punctuations or the
   sentence boundaries of the language under consideration. Hence, it is
   not focused on a particular language pair and may work for different
   language pairs where the words have been segmented.

3. K-vec deals only with word to word translations and cannot find
   phrasal translations. Phrasal translations are the cases where a
   word/group of words in one language is/are translated to word/group
   of words in another language.


3. Background concepts
------------------------

In this section we describe some of the terms we will be using during
the K-vec algorithm description.

3.1 Two by Contingency tables
------------------------------
Contingency tables are a simple way of representing the data. Consider an
example where you want to categorize the students depending on whether
or not they have registered for physics and chemistry courses. Suppose
the sample size under consideration is of 10 students. Let the number
of students registered for both the physics and chemistry course, only
for physics course and only for chemistry course be 3, 2, and 1
respectively. This data can be represented in two by two matrix called
as contingency tables as follows.

                                      Y

                    --------------------------------------------+
                    |  physics   |   !physics    |    totals    |
                    +------------+---------------+--------------|
   X     chemistry  |     a=3    |      c=1      |     a+c=4    |
        !chemistry  |     b=2    |      d=4      |     b+d=6    |
            --------+------------+---------------+--------------|
            totals  |   a+b=5    |    c+d=5      |  T=a+b+c+d   |
            ----------------------------------------------------+

Cell 'a' represents the total number of students registered for both the
courses. Cell 'b' and 'c' and represent the total number of students
registered only for physics and chemistry respectively.

Representing data in this forms helps to get additional information.
The first row represents the total number of students who have
registered for the chemistry course and the second row represents the
total number of students who have not registered for the chemistry
course. Similarly, the first column represents the total number of
students who have registered for physics course and the second column
represent the total number of students who have not registered for
physics course.

3.2 Tests of association
-------------------------- 

Tests of association are statistical tests for finding the similarity
between two entities.  Once a contingency table is created, several
tests of association like Pointwise Mutual Information, T-score,
Log-likelihood ratio, Dice co-efficient, Odds ratio, Right sided
Fisher's test, etc... can be used to find whether or not the two
variables in the contingency table are related depending on the
scores produced by these tests. Fung and Church use T-score and
Pointwise Mutual Information as the measures which are explained
below.


3.2.1 Pointwise Mutual Information
-------------------------
Pointwise Mutual Information is the ratio of the 
probability of the two events X and Y occurring together to the combined 
probability of the two events occurring independently. This is a direct 
comparison of what is observed to what would be expected if the two events 
were independent. If these probabilities are the same then the two events 
are independent (and the PMI score will be 0.0).  


 PMI = log2 ( prob(X=chemistry, Y=physics) / ( prob(X=chemistry) prob(Y=physics))  

If PMI = 0, then the two variables are independent.
   PMI > 0, then the two variables are related.

If the observed probability is greater than the product of the two
unconditional probabilities then evidence for dependence is
higher. Since there is no exact point at which we cross the line from
independence to dependence the interpretation of this test and most of
the others is something of an art.


3.2.2 T-score
-------------

The T-score is defined as a ratio of difference between the  
observed and the expected mean to the variance of the sample. Note that 
this is a variant of the standard t-test that was proposed for use in the 
identification of collocations in large samples of text. 


  t-score =  prob(X=chemistry, Y=physics) - ( prob(X=chemistry) prob(Y=physics) )  
             ----------------------------------------------------------------------
                       sqrt((1/T) prob(X=chemistry, Y=physics))
 

4. The K-vec Algorithm 
----------------------

1. The K-vec algorithm divides the two texts into k different
   pieces. A piece then consists of certain number of words from the text.

2. For each word in both the texts it checks whether the word occurs in
   a piece and represents its distribution in the form of a
   k-dimensional binary vector(k is the number of pieces). Occurrence of
   a word in a particular piece is indicated by a value '1' in the binary
   vector. A value '0' indicates that there was not a single occurrence
   of that word in the piece. It does not consider the frequency of a
   word in a particular piece.

Consider an example where English-French parallel text is divided into 10
pieces. Consider an English word "king" to occur 5 times in piece 2 and 7 times in
pieces 8, then the binary vector for the the word "king" will be

                    V(king) = <0, 1, 0, 0, 0, 0, 0, 1, 0, 0>

3. It forms a two by two contingency table for the word pair where the
two words are from the texts of the two languages under consideration.
For example consider a word pair "king"->"roi", where "king" is a word
from the English text and "roi" is a word from the French text


                                        Y
                   ----------------------------------------------+
                   |      roi     |      !roi       |totals      |
                   +--------------+-----------------+------------|
       X    king   |        a     |       c         |     a+c    |
            !king  |        b     |       d         |     b+d    |
           --------+--------------+-----------------+------------|
           totals  |     a+b      |      c+d        |  T=a+b+c+d |
           ------------------------------------------------------+


Cell 'a' is filled by counting the total number of times 1 occurs in
the corresponding positions of the vectors of the two words under
consideration. It represents the number of times a piece
contains the word "king" and the corresponding piece contains the word "roi".

Cell 'a+c' is filled by counting the total number of times 1 occurs in
the vector for the word "king". It represents the total number of pieces
containing the word "king".

Cell 'c' is filled by subtracting the value in the cell 'a' from the
value in the cell 'a+c'. It represents the number of times a piece
contains the word "king" and the corresponding piece does not contain the
word "roi".

Cell 'a+b' is filled by counting the number of times 1 occurs in the
vector for the word "roi". It represents the total number of pieces
containing the word "roi".

Cell 'b' is filled by subtracting the value in the cell 'a' from the
value in the cell 'a+b'. It represents the number of times a piece
contains the word "roi" and the corresponding piece does not contain the
word "king".

Cell 'b+d' and 'c+d' is filled by counting the total number of times
0 occur in the vector for the word "king" and "roi" respectively. 'b+d' represents the
total number of pieces which do not contain the word "king". 'c+d'
represents the total number of pieces which do not contain the word "roi".

Cell 'd' is filled by subtracting the value in the cell 'b' from the
value in the cell 'b+d'. It represents the number of times a piece does
not contain the word "king" and the corresponding piece does not contain
the word "roi".


Cell 'a+b+c+d' is filled by adding the values from the cell 'a', 'b',
'c' and cell 'd'. It represents the total number of pieces into which
each of the text is divided.


4. It then uses a test of association to find how closely related the
two words are. K-vec uses Pointwise Mutual Information(PMI) and
T-score as tests of associations. According to K-vec, MI is not a good
test of association when counts are small. Hence, it prefers to use
T-score over PMI.

4.1 Number of pieces
----------------------------------

If the number of pieces into which the text is to be divided is very large,
then total number of words in each piece is small and a word and its 
translation may not occur in the corresponding pieces. K-vec may miss such 
translations as it looks for the word translations in corresponding
pieces. If the number of pieces is very small, then the number of words in
each piece will be large and the basic advantage of dividing the text into
pieces and looking for a word and its translation into corresponding piece
is lost.

Fung and Church suggest that K-vec divide the text into a number of   
pieces equal to the square root of the total number of word tokens in the  
text. For huge data the number of word tokens in each piece will therefore  
be large. We thus believe that one does not always get best results by  
dividing the text in pieces equal to the square root of the total number  
of words in the text and evaluate this in our experimental work.

4.2 Determining candidate translation
------------------------------------------


Fung and Church do not consider all the word pairs from the two texts  as 
possible or candidate translations because there will be too many such  
word pairs. They  restrict the algorithm to word pairs with frequencies   
between 3 and 11. They do not consider the low frequency word pairs  
because the amount of information about these words is not sufficient to  
find a translation. The words that make high frequency word pairs in both  
languages in the parallel text will occur in almost every piece. K-vec  
looks for word correspondences in corresponding pieces and therefore every  
high frequency word in one language will be considered as a translation of 
the high frequency words in another language. 

For example, if there are n high frequency words in one language and m
high frequency words in another language, then there will be almost nm
word pairs formed using these words and only n of these can be
correct. Measures of association will be unable to differentiate such
words and do not consider them for translation.  While the idea of a
frequency cutoff is sound, we do not believe that an upper limit of 11
(used by Fung and Church) is always the correct choice, but should
rather be dependent on the number of pieces.


For a more detailed description of the K-vec algorithm refer to [1].

4.3 Example for the K-vec Algorithm 
------------------------------------

The following is an example from the K-vec paper[1] to show how the
algorithm works.

Consider a sample of English and French text with the English text
having 1000 word and the French text having 900 words. It was found
that the word fisheries occurs 19 times in the English text and the
word peches occurs 21 times in the French text.

1. Consider the English and the French texts to be divided into 10
   pieces such that each piece of the English text contains 100 words
   while each piece of the French text contains 90 words.
   
2. Out of the 19 occurrences for the word fisheries in the English text, it was 
found that it occurs 3 times in piece 2 and 16 times in piece 8. So the binary 
vector for the word fisheries is 
           V(fisheries) = < 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 >

Out of the 21 occurrences for the word peches in the French text, it was found 
that it occurs 4 times in piece 2 and 17 times in piece 8. So the binary vector
for the word peches is
           V(peches) = < 0, 1, 0, 0, 0, 0, 1, 0, 0, 0 >

3. Create a two by two contingency table for fisheries->peches.

                                     
                                        Y
                   --------------------------------------------+
                   |   peches   |    !peches    |    totals    |
                   +------------+---------------+--------------|
      X   fisheries|      a=2   |          c=0  |           2  |
         !fisheries|      b=0   |          d=8  |           8  |
           --------+------------+---------------+--------------|
           totals  |         2  |            8  |         T=10 |
           ----------------------------------------------------+
              

4. Use different measures of association to find whether or not the
   word "fisheries" and "peches" form a good translation pair.
  
 
   PMI = log2 ( prob(fisheries, peches) / ( prob(fisheries) prob(peches) ) ) 
   = log2 ( (0.2) / ( (0.2) (0.2) ) ) 
   = 2.3219

PMI value of 2.3219 suggests that the words "fisheries" and "peches" are
translations of each other.

   t-score = ( prob(fisheries, peches) - ( prob(fisheries) prob(peches) ) ) 
             -------------------------------------------------------------- 
                       sqrt((1/T) prob(fisheries, peches))
 
           = (0.2) - (0.2)(0.2)
             ------------------
             sqrt((1/10)(0.2))
 
           = 1.314

    
5. Program kvec.pl 
------------------

kvec.pl is the Perl implementation of the K-vec algorithm. 

Usage:
perl kvec.pl [{[--pieces K --lcutoff N --ucutoff M] TARGET SOURCE |--help|--version}]

SOURCE is the text in one language.
TARGET is the text in another language.  

This program takes as input two files which are translations of each other in 
two different languages and outputs...

1. N i.e. the total number of pieces.

2. For each word pair XY (where X is a word from SOURCE and Y is a word from 
TARGET), it outputs


'a' - The total number of times a piece of the English text contains
the word X  and the the corresponding piece of the French text contains
the word Y.

'b' - The total number of pieces containing the word X.

'c' - The total number of pieces containing the word Y.


The format of the OUTPUT is as follows

N
word from SOURCE<>word from TARGET<> a a+b a+c 
word from SOURCE<>word from TARGET<> a a+b a+c 
word from SOURCE<>word from TARGET<> a a+b a+c 
.  
.  
.  
word from SOURCE<>word from TARGET<> a a+b a+c


kvec.pl also has following command line options...

a. --lcutoff N - Allows the user to specify an integer value N, such
   that, only the word pairs formed using the words occurring equal or
   more number of times in the corresponding pieces are considered as
   valid corresponding pairs and present in the output of K-vec. For
   example, if user specifies cutoff to be 7, all the word pairs with
   frequency less than 7 will not be considered as valid word
   pairs. By default, the lcutoff frequency is 4.

b. --ucutoff M - Allows the user to specify an integer value M, such
   that, only the word pairs formed using the the words occurring
   equal or less number of times in the corresponding pieces are
   considered as valid pairs and present in the output of K-vec.  For
   example, if user specifies cutoff to be 10, all the words with
   frequency more than 10 will not be considered as valid word
   pairs. By default, it considers all the words above the lcutoff.

c. --pieces K - Allows the user to specify how many pieces the source
   text and target text should be divided into. By default the number of
   pieces in which both the source text and target text are divided is
   set to the square root of total number of words in the source text.

d. --version - Displays the version number and copyright information.

e. --help - Displays the description about various command-line options.

6. Program "statistic.pl"
----------------------- 

It is a Perl program from the Bigram Statistics Package(BSP) by Ted Pedersen and 
Satanjeev Banerjee.

"statistic.pl" can be used to apply different tests of associations
and find relations between different entities which in our case are
the words from different languages. We use tests/measures like like
Pointwise Mutual Information, Dice Co-efficient, Log-likelihood ratio,
etc... implemented in BSP. Also, we use test/measures like Odds ratio,
T-score and right sided Fisher's test implemented as part of this
package. We use statistic.pl for applying these tests to the output of
kvec.pl to find the best word correspondences.

"statistic.pl" ranks the word correspondences depending on their value (calculated
using statistics). Higher the rank, better is the word correspondence. 

"statistic.pl" has the following command-line options:

a. --frequency N - Allows the user to specify an integer value N, such
   that, the statistic tests will be applied to the word pairs XY
   only if they occur together in at least N corresponding pieces. For
   example, if N is 5, then the statistic test will be applied only if
   the word X and the word Y occur in 5 or more than 5 corresponding
   pieces.

b. --rank N - Allows the user to specify an integer value N, such that
   all the word correspondences having rank higher than N, will not be 
   be considered as a good word correspondences and will not be present
   in the output.

c. --score - Allows the user to specify an value N, such that
   all the word correspondences having score less than N, will not be 
   be considered as a good word correspondences and will not be present
   in the output.

Usage:
statistic.pl STATISTICS_LIBRARY DEST SOURCE

Here
STATISTICS_LIBRARY can be any of the tests of association such as mi, dice, 
ll(log-likelihood), etc...
SOURCE is the file created by kvec.pl.
DEST is the output file.


***NOTE****
To run statistic.pl user should have BSP in the path. For example we have an
entry in the .cshrc file 
/home/cs/varm0003/BSP

BSP is the directory where statistic.pl program and the perl modules
for all the tests of association are present.

**********

For details about the program "statistic.pl" please refer to the BSP
README[2] file.

6.1 Example
-----------
Consider the example and the contingency table we created in the K-vec 
algorithm description. For the word pair fisheries->peches we will have the
following values

10 
fisheries<>peches<>2 2 2

Suppose this is given as input to "statistic.pl" with MI as 
STATISTICS_LIBRARY, the output will be

10 
fisheries<>peches<>1 2.3219 2 2 2

Here
10 is the total number of pieces.  
fisheries(word from SOURCE text) 
peches(word from TARGET text) 
1(rank) 
2.3219(MI value for this pair) 
remaining values are a, (a+b), (a+c) (as described in section 5)
 
7. Tests/measures of association perl modules
-----------------------------------------

We implemented various tests/measures of association,
T-score(tscore.pm), Odds ratio(odds.pm), phi-Coefficient(phi.pm) and
right sided Fisher's test(rightFisher.pm) as a part of this
package. Bigram Statistics Package[2] allows user to write his own
statistic package and plug it with statistic.pl. For instance, to use
T-score as test of association in statistic.pl, use the following
command

statistic.pl tscore.pm eng.french.tscore eng.french

where
tscore.pm is the STATISTICS_LIBRARY
eng.french.tscore is the output file.
eng.french is input file(created by kvec.pl)

In the K-vec algorithm, Fung and Church[1] prefer to use t-score over mi. This
is because mi is unreliable for words with low frequencies and may rank such
words very high.


8. "ensemble.pl"
---------------

"ensemble.pl" is a perl program to combine the output of different
files produced by statistic.pl. Instead of considering all the word
pairs produced by statistic.pl as word correspondences, ensemble.pl
considers a words in a pair to be translation of each other only if
receives specified number of votes(i.e. present in those many output
files).

Usage:

perl ensemble.pl [file1 file2{file3, file4...}]{--help|--version|--votes}

Here,
file1, file2, ... are input files i.e. output files of statistic.pl

"ensemble.pl" has following command line options

a. --version - Displays the version number and copyright information.

b. --help - Displays the description about various command-line options.

c. --votes - This is minimum number of input files(for ensemble.pl)
 that the word pair has to be present in to be considered as a word
 correspondence. By default, the value of votes is 1 and the value
 cannot be greater than number of input files.

Example
perl ensemble.pl --votes 2 file1 file2 file3
In the above case a word pair will be considered as valid word
correspondence and will be present in the output of ensemble.pl only
if it present in at least 2 of the 3 input files.

For example if the word pair "fisheries peches" occurs in the required
number of the output files. Then there is an entry for the word pair
as follows

fisheries<>peches 

We kept the output format similar to the format of the statistic,pl
program. We can thus use the same evaluation program "eval.pl"
described below to find the precision, recall and f-measure values.

9. eval.pl
----------

"eval.pl" is a perl program to compare the output of statistic.pl or
ensemble.pl with the gold standard data and compute the precision,
recall and f-measure values.

usage:eval.pl [ { --help | --version } DATAFILE1 DATAFILE2 ]

DATAFILE1 is the GOLD_DATA
DATAFILE2 is the OUTPUT of the statistic.pl or ensemble.pl program


10. Test Run 
-----------

This section will describe the entire procedure to find the word 
correspondences using the kvec.pl and statistic.pl programs.

Run kvec.pl for the parallel text provided with the package

perl kvec.pl --lcutoff 3 --ucutoff 11 French.txt English.txt

With the above command both the texts will be divided into number of
pieces equal to the square root of the total number of words in the
source text. Only the word pairs with frequency between 3 and 11 will
be be considered as valid word pairs.


***Run statistic.pl with different test and command-line options***

statistic.pl --rank 25 dice eng.french.dice eng.french

With the above command the statistic.pl will rank the word
correspondences according to the dice values, but only the word
correspondences with a rank less than or equal to 25 will be considered as
valid translation pairs.


statistic.pl --frequency 3 mi eng.french.mi eng.french

With the above command the statistic.pl will rank the word
correspondences according to the mi values, but only the words which occur
together in three or more will be considered as valid
translation pairs.


statistic.pl --score 10 ll eng.french.ll eng.french

With the above command the statistic.pl will rank the word
correspondences according the log-likelihood values, but only the word
correspondences with the values greater than equal to 10 will be
considered as valid translation.

One can try all possible combination of these tests, and command-line
options.

Once output files using statistic.pl are created, ensemble.pl can be
used to combine word pairs present in these output files depending on
whether or not they are present in specified number of input files.


perl ensemble.pl --votes 3 eng.french.ll eng.french.mi eng.french.dice 

Here the files eng.french.ll, eng.french.mi and eng.french.dice are
output files of statistic.pl for the Log likelihood ratio, Dice
Co-efficient and Pointwise Mutual Information respectively. These
files are given as input to ensemble.pl. Votes is the number of input
files the word pair has to be present in to be considered as a valid
translation pair and present in the output of ensemble.pl. For the
above example, a word pair needs to be present in all of the 3 input files
to be considered in the output of ensemble.pl i.e a valid translation
pair.


11. System Requirements 
----------------------

1. Perl(v5.6.0).  
This package is written in Perl. Perl is freely available from 
http://www.perl.org

2. BSP(Bigram Statistics Package- version 0.5(BETA)(strict requirement)).
This package is written in Perl. BSP is freely available from
http://www.d.umn.edu/~tpederse/code.html

12. Copying 
------------ 

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option) 
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with 
this program; if not, write to the Free Software Foundation, Inc., 59 Temple 
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file 
GPL.txt that you should have received with this distribution.

13. Acknowledgment 
------------------

We would like to thank Fung. P. & K. W. Church, for the examples and 
the description of the K-vec algorithm which has made the task much easier.

13. References 
-------------- 

1. Fung. P. & K. W. Church. (1994) "K-vec: A New Approach for Aligning Parallel
Texts." Proceedings from the 15th International Conference on Computational 
Linguistics, Kyoto.

2. Bigram Statistics Package by Ted Pedersen & Satanjeev Banerjee.
http://www.d.umn.edu/~tpederse/code.html