			          Hansard's Gold Data Package   
			   ======================================
			                  Version 0.2
				    Copyright (C) 2001-2002
 
			      Ted Pedersen, tpederse@d.umn.edu 
                               Nitin Varma, varm0003@d.umn.edu 
                               University Of Minnesota, Duluth
                       


1. Introduction
---------------

This is a simple perl program to compile the gold standard data for the
500 sentences of the Hansard's data which are manually annotated by
Franz Josef Och[1]. The Hansard's data consists of an English-French
parallel text which comprises of the proceedings of the Canadian
parliaments. It is manually annotated i.e. the human beings have
manually linked the words from the English text to their translation
in the French text. Manual annotation of the data is very time
consuming task and hence is very rare. We decided to use the Hansard's
data because it is free and is easily available online. The most
important reason being is that the data is manually annotated and
hence is a reliable source to compile the gold standard data, the
lexicon which has translations for all the words in it. There are lot
of standard dictionaries available for the English and the French language
pair. These dictionaries are not comprehensive and therefore may not
have translation for all the words of the parallel text. Also a word
may have different translation depending on the context in which it is
used. The standard dictionaries may not have some of these
translations or may have some translation which may not be even used
in the parallel text. The gold standard standard lexicon is compiled
using the parallel text itself and thus has translations for all the
words in it and also in the context in which they are used.

This readme describes in detail and the approach we took to compile
the gold standard bilingual lexicon for the 500 sentences of the
Hansard's data provided by Franz Och. User needs to obtain this data
directly from Franz Och.

Hansard's data
---------------

The Hansard's data mainly consists of three files

e (English file)
f (French file)
ef.alignment (file containing the manual annotations for the 500
sentences)

All the three files need to be present in directory called "Hansards-gold".

Both the "e" and the "f" files contain 49,500 sentences. The manual
alignments available in ef.alignment file are just for first 500
sentences. We thus compile gold standard data for these sentences.


For the Hansard's data if there are n word tokens in a sentence, then
the words are numbered from 0 to n-1. Word 0 therefore represents the
first word of the sentence. The format of the aligned data is as shown
below. 

Sentence:401
S  0  0
S  2  1
P  0  0
P  1  1
P  1  2 



The S and P alignments distinguish between word by word and phrasal
alignments. P alignments are phrasal, and we do not utilize them in
K-vec++. Thus, our attention is focused on the S alignments.

The above example shows that in the sentence number 401, the entry ``S 0
0'' tells us that the word at position 0 in English is translated to the
word at position 0 in the French sentence. Similarly the entries ``P 1 1'' and
``P 1 2'' tells us that the word at position 1 in the English sentence is
translated to the words at position 1 and 2 in the French sentence and
represents an occurrence of phrasal translations. As K-vec does not deal
with phrasal translations, we only use the S alignments for creating the
gold standard lexicon. There are 44,351 tokens in the gold standard 
lexicon created from aligned version of the Hansard's, and from that  
1,528 types are found. This is the number of `entries' in that lexicon. 


4. gold.pl
------------
This program takes the ef.alignment file(file which contains the word
position number of a word and its translation for each of the 500
sentences) as input and outputs word pairs in the forms X -> Y, where
X is a word from the English text and Y is a word from the French
text. 


Usage:
gold.pl E_FILE1 F_FILE2 EF_FILE

Here,
E_FILE - is the English file
F_FILE - is the French file
EF_FILE - is the alignment file

gold.pl has following command-line options

1.  --version     Prints the version number

2.  --help        Prints the help message
 


4. System Requirements 
----------------------

1. Perl(v5.6.0).  
This package is written in Perl. Perl is freely available from 
http://www.perl.org


5. Copying 
------------ 

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option) 
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with 
this program; if not, write to the Free Software Foundation, Inc., 59 Temple 
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file 
GPL.txt that you should have received with this distribution.

6. Acknowledgment 
------------------

We would like to thank Franz Josef Och for providing the manually
annotated 500 sentences of the Hansard's data for research purpose and
also providing a detailed description about the data.

6. References 
-------------- 
1. Franz Josef Och, Hermann Ney. "Improved statistical Alignment
Models". Proc. of the 38th Annual Meeting of the Association for
Computational Linguistics, pp.440-447, China, October 2002

