			         Blinker Gold Data Package   
			   ======================================
			                  Version 0.2	
			    Copyright (C) 2001-2002
 
			      Ted Pedersen, tpederse@d.umn.edu 
                               Nitin Varma, varm0003@d.umn.edu 
                               University Of Minnesota, Duluth
                       


1. Introduction
---------------

The parallel text used by the K-vec algorithm to compile a bilingual
lexicon is obtained from the Blinker data [2].  The Blinker data
consists of the English-French version of the online Bible text. The
texts are not actual translations of each other and are translations
of different and older versions of bible. It is manually annotated
i.e. the human beings have manually linked the words from the English
text to their translation in the French text. Manual annotation of the
data is very time consuming task and hence is very rare. We decided to
use the Blinker data because it is free and is easily available
online. The most important reason being is that the data is manually
annotated and hence is a reliable source to compile the gold standard
data, the lexicon which has translations for all the words in
it. There are lot of standard dictionaries available for English and
French language pair. These dictionaries are not comprehensive and
therefore may not have translation for all the words of the parallel
text. Also a word may have different translation depending on the
context in which it is used. The standard dictionaries may not have
some of these translations or may have some translation which may not
be even used in the parallel text. The gold standard standard lexicon
is compiled using the parallel text itself and thus has translations
for all the words and in the context in which they are used.

This readme describes in detail the Blinker data and the approach we
took to compile the gold standard bilingual lexicon. User can obtain
the Blinker data from http://www.cs.nyu.edu/~melamed/datasets.html and
needs to save and untar it in the Directory called "Blinker-gold"

2. Blinker Data
---------------

The main objective of the Blinker project by I Dan. Melamed  was to create a
gold standard lexicon that can be used to evaluate the lexicon created by
different algorithms for compiling bilingual lexicons. The first and the
foremost requirement for creating a gold standard lexicon is that it be 
based on actual parallel text. The Blinker project is  based on an  
English-French version of The Bible because the text is freely 
available and in the public domain. 

According to Melamed[2],  out of the 66 books for The Bible,  
Ecclesiastes, Hosea and Job are not well understood and have inconsistent  
translations and therefore were not included in the Blinker project data.  
The remaining 63 books comprise 29614 verses.  Out of these verses 
250 verses were selected to be manually aligned. Manually aligning the 
entire Bible is simply too time consuming, so only a portion was carried 
out. The following is the procedure used to select the verses:


1. Pre-process both the English and the French verses and tokenize
the punctuations from the word they are adjacent to and also separate
the hyphenated words. The aim was to create multiple words from such
words.  The resulting parallel text had 814451 {\tokens} (total
number of words) and 14817 {types} ( total number of distinct
words) in the English half and 896717 tokens and 21372 types in the
French half.

2. Count the frequency of each word type in the verses.

3. Randomly select 25 word types for types with frequencies one, two, 
three, and four occurrences (for a total of 100 types). 

4. A verse is selected if any one of the 25 selected word types occurs 
in it. If a selected verse has more than one occurrence of the same word 
type, then it is replaced by another word type of that frequency. Also, if  
two different word types occur in the same selected verse, then the word  
type with lower frequency is replaced by another word type of that  
frequency. The aim was to select only one verse for each occurrence of a  
word type.

5. Repeat step 4 till the total number verses selected is equal to (1 + 2
+ 3 + 4) $\times$ 25 = 250. For each word type, the number of verses
selected is equal to its frequency.


These 250 verses in their English and French version together form
the parallel text for the Blinker data. It has 7510 tokens in the English 
half, 8191 token in the French half and 1714 and 1912 types respectively.

3. Description of the annotation files
--------------------------------------

In all seven translators were recruited to manually align the selected  
verses. Since they annotate the data with alignment information, the 
translators are usually referred to as annotators. 

The 250 verses selected above were divided into two parts. Part 1   
contained the verses from 1-100 and was annotated by annotators 1, 2, 3, 4   
and 5. Part 2 contained the verses from 101-250 and was annotated by  
annotators 1, 2, 3, 6, 7. Each verse was therefore annotated by five  
different annotators. For each verse, the annotator had an output file  
which contained a set of pair of numbers corresponding to position of the  
word in the English verse and the position of its translation word in the  
corresponding French verse. Consider an example where words in an English  
verse are translated to the words in the corresponding French verse by  
different  annotators as shown in the file annotations.ps. The  
annotations shown are for verse 12 of the Blinker data.

The output file by the annotator 1 for the above verse looks like
0 ->  1
1 ->  2
2 ->  3
6 ->  4
3 ->  5
4 ->  5
5 ->  6
7 ->  7
8 ->  8
9 ->  9
10 ->  10
11 ->  10
12  -> 11
13 ->  12
14 ->  13
14 ->  14
15 ->  15

For a word at a particular position the number of entries in the
output file is equal to the number of words to which it is
translated. In the figure annotations.ps the word at position
14 is translated to the words at position 13 and 14 and thus has two
entries in the output file as follows
14 -> 13
14 -> 14

The above translation represents an occurrence of phrasal
translation, translation where one or more than one words in a
language are translated to one or more than one words in another
language.

There is an entry in the output file for each occurrence of a word
type. For example for the word type  'the' occurring at position 1
and 8 and translated to the same word 'les' at positions 2 and 8
respectively, there are two entries in the output file as follows
1 -> 2
8 -> 8

A word in the English or French verse which has no translation in the
corresponding French or English verse respectively, for example the
word 'alors' in the French verse in figure annotations.ps, is
translated as NULL. It is indicated in the figure by having no link
for it and is represented in the output file as follows
0 -> 1

The output file thus has an entry for each word position in the
English verse.

The "*.open" files for each annotator are same as the above output
files. The difference is that it only contains the annotations for
content words. We have an option to compile gold standard data for all
the word types or just for content words.


4.1 Program "consensus.pl"
--------------------------

"consensus.pl" is a perl program that selects only those annotations
which are present atleast in the majority number of the output files
of a group of selected annotators. For example consider a a group of
three annotators 1,2, 3 and their output file for the same verse

Let the output file for annotator 1 be
1  -> 2
1  -> 3
2  -> 4

Let the output file for annotator 2 be
1  -> 2
2  -> 4

Let the output file for annotator 3 be
1  -> 2
1  -> 3
2  -> 4

As there are three annotators, the number of files a annotation need
to present to be considered in the output of consensus.pl is
2. Therefore, we get the following output file for the above verse

2  -> 4 (present in all the 3 files i.e annotator 1,2,3)
1  -> 2 3(present in files of annotator 1 and 3)

So the word at position 2 is translation of the word at position 4, as
all three annotators recommended it. But the word at position 1 is
translated to word at position '2 and 3'(recommended by annotators
2,3) and not just to word at position 2(recommended by annotator 1).

Usage
consensus.pl [{[--content] BLINKER_DIR OUT_DIR ANNOTATOR_LIST |--help|--version}]
Here,
BLINKER_DIR - is path of the directory where Blinker data is installed and
directories for all the annotators are present. The path should be
given within double quotes.

OUT_DIR - is the directory where the consensus files are stored, i.e
the output files of consensus.pl for each verse

ANNOTATOR_LIST - group of annotators for which we find consensus
for. For example if we want to find consensus amongst the annotators 1
2 3, we give these annotators numbers seperated by " " as input.

consensus.pl has following command-line options

--content - if this option is used then the consensus.pl uses the
"*.open" files as input i.e. the files containing only the annotations
for the content words. So the gold created will be only for the content
words.

--help - prints the help message

--version - prints the version and copyright information.


4.2 Program "token.pl"
----------------------
"token.pl" is the perl program for creating word-token based gold data
using the files created by the program "consensus.pl".  It takes each
output file of consensus.pl and modifies the entries. If a two entries
in a file(i.e words at two different positions) are translated to the
word at the same position, it is modified as shown in the example below
 
10 -> 10
11 -> 10
are modified to 
10  11 -> 10

Another example will be
2 -> 3 4
3 -> 3 4
will be modified to
2 3 -> 3 4

The word positions are then replaced by the actual words from the
corresponding verses to form the token based gold data

the ->  les
flood ->  eaux
us ->  nous
would  have ->  auraient
engulfed ->  submerges
, ->  ,
the ->  les
torrent ->  torrent
would  have->  auraient
swept  -> passe
over  ->  sur
us ->  notre  ame
, ->  ;

Usage:
token.pl [ { --help | --version } SOURCE TARGET CONSENSUS_DIR ]

Here,
SOURCE is the file which contains all the English verses.

TARGET is the file which contains all the French verses.

CONSENSUS_DIR is the directory where the output files of consensus.pl
are present.

"token.pl" has following command-line options

a. --version - Displays the version number and copyright information.

b. --help - Displays the description about various command-line options.


4.3 "program type.pl"
----------------------
"type.pl" is the perl program which for creating type based gold data
from the token based gold data created by "token.pl". 

Word-type gold data is prepared from the word-token gold data by
having a single entry for a word if it is translated to same
words/words. For example if the token based gold was only made up of
the entries above, then for the two entries for the word type 'the'
are translated to the same word 'les', there is a single entry in the
type based gold data. Similarly we have single entry for 'would have'
translated to 'auraient'.  But if the two or more entries in token
based gold for a word type are translated to different word/words, we
keep all of them in the type based gold data.  For example there are 2
entries for ',' as it translated to different types ',' and 
';'. The type based gold data is a follows

the ->  les
flood ->  eaux
us ->  nous
would  have ->  auraient
engulfed ->  submerges
, ->  ,
torrent ->  torrent
swept  -> passe
over  ->  sur
us ->  notre  ame
, ->  ;



Usage: 
token.pl [INPUT OUTPUT { --help | --version } ]

Here,
INPUT is the file which contains the token based gold data.
OUTPUT is the filename in which the type based gold data will be
written

"type.pl" has following command-line options

a. --version - Displays the version number and copyright information.

b. --help - Displays the description about various command-line options.

4.4 filter.pl
------------
The type based gold which we prepare has phrasal translations in
it. We make this gold data available but as K-vec does not find
phrasal translation we filter those out from the data.

Usage: 
filter.pl [INPUT OUTPUT1 OUTPUT2 { --help | --version } ]
Here,
INPUT is the file which contains the type based gold data.
OUTPUT1 is the filename in which will contain the type based gold data
with no phrasal translations.
OUTPUT2 is the filename in which will contain the type based gold data
with only the phrasal translations.


"filter.pl" has following command-lone options
a. --version - Displays the version number and copyright information.

b. --help - Displays the description about various command-line options.


4.5 script.pl
-------------

Instead of creating the gold standard data by running individual
programs, "script.pl" (simple perl script) provides an automatic means
for doing the same. the usage for script.pl is same as consensus.pl.

Usage
script.pl [{[--content] BLINKER_DIR OUT_DIR ANNOTATOR_LIST |--help|--version}]
Here,
BLINKER_DIR - is path of the directory where Blinker data is installed and
directories for all the annotators are present. The path should be
given within double quotes.

OUT_DIR - is the directory where the consensus files are stored, i.e
the output files of consensus.pl for each verse

ANNOTATOR_LIST - group of annotators for which we find consensus
for. For example if we want to find consensus amongst the annotators 1
2 3, we give these annotators numbers seperated by " " as input.

consensus.pl has following command-line options

--content - if this option is used then the consensus.pl uses the
"*.open" files as input i.e. the files containing only the annotations
for the content words. So the gold created will be only for the content
words.

--help - prints the help message

--version - prints the version and copyright information.




5. System Requirements 
----------------------

1. Perl(v5.6.0).  
This package is written in Perl. Perl is freely available from 
http://www.perl.org

2. Blinker data.(I Dan. Melamed). It is freely available from
http://www.cs.nyu.edu/~melamed/datasets.html


6. Copying 
------------ 

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option) 
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with 
this program; if not, write to the Free Software Foundation, Inc., 59 Temple 
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file 
GPL.txt that you should have received with this distribution.

7. Acknowledgment 
------------------

We would like to thank I Dan.Melamed for providing the Blinker data
for research purpose and also providing a detailed description about
the data.

8. References 
-------------- 
1. I Dan. Melamed, "Empirical Methods for Exploiting Parallel Texts",
The MIT Press, Cambridge, MA, London-England.

2. I. Dan Melamed. (1998) "Manual Annotation of Translational
Equivalence: The Blinker Project," Institute for Research in Cognitive
Science Technical Report #98-07. University of Pennsylvania,
Philadelphia, PA.
