Conversion of Senseval-1 data to Senseval-2 Format: A Readme ============================================================ Version 0.31 Satanjeev Banerjee and Ted Pedersen University of Minnesota, Duluth {bane0025, tpederse}@d.umn.edu 7th September, 2001 1. Introduction: ---------------- We have converted the English language material available as a part of the 1998 Senseval Word Sense Disambiguation evaluation exercise to the format used for tasks in the Senseval 2 Word Sense Disambiguation exercise. The English language resources of the 1998 Senseval exercise and an accompanying discussion by Adam Kilgarriff have been placed in the public domain. These can be obtained from [1]. Information on Senseval 2 can be obtained from [2]. The directory Dry-Run-Data contains a script that converts the Senseval 1 dry run data into Senseval 2 format. It also includes a converted copy of this data. 2. The Format of the English Language Material Used in the 1998 Senseval: ------------------------------------------------------------------------- The English language material used in the 1998 Senseval word sense disambiguation exercise consists of the following 4 components: 2.1> The training data: This is available in tar and gzipped format from the following url: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/train.tar.gz. Upon unzipping and untarring, one obtains a directory, TRAIN, consisting of 29 word files. Each word file contains blank line separated instances of the word. Each instance has a number (the instance id) and a tag, in the format word. 2.2> The test data: This is available in tar and gzipped format from the following url: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/test.tar.gz. Upon unzipping and untarring, one obtains a directory, TEST, consisting of 41 evaluation tasks. Except for 5 tasks, each task consists of a single word used in a single part of speech. Thus there are 15 words used as nouns (filename format: word-n.eval), 13 as verbs (filename format: word-v.eval), 8 as adjectives (filename format: word-a.eval) and 5 words whose part of speech is indeterminate (filename format: word-p.eval). Each word file contains blank line separated instances of the word. Each instance has a number (the instance id) and a tag, in the format word. The original Senseval-1 data as mentioned above in 2.1 and 2.2 has certain instances which have erroneous head word tags. These errors have been corrected and the data is available for download as the Senseval1-fix package at... http://www.d.umn.edu/~tpederse/research.html The user may use this corrected Senseval-1 data or the original data as input to this package. 2.3> The gold standard taggings: This is available in tar and gzipped format from the following url: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/gold.tar.gz. Upon unzipping and untarring, one obtains a directory, GOLD, consisting of 41 files, corresponding to the 41 evaluation tasks in the TEST directory (obtained above). Each file contains the "answers" to each instance id of the corresponding test file. Thus for each instance id in file TEST/word-x.eval there is an instance id in file GOLD/word-x, followed by a mnemonic that identifies the sense of the word. 2.4> Mnemonic to numeric sense identifier map: This is available at the following location: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/mne-uid.map. This file contains mappings of senses from the mnemonic (used in the test files) to the numeric-identifier form (used in the training files). 3. The Format of the English Language Material Used in Senseval 2: --------------------------------------------------------------- The English language material used in the "english-lex-sample" task of Senseval 2 consists of three main components: 3.1> The training data: This is one xml file containing all the words (filename eng-lex-sample.training.xml if downloaded from http://www.cis.upenn.edu/~cotton/senseval/corpora.tgz). XML format is used to encode the data. Each word is encapsulated within tags. Within each word there are a number of instances, each instance encapsulated within tags. For each instance, answers are given in tags. Every word has its part of speech identified in the lexelt tag. 3.2> The test data: This is one xml file containing all the words (filename eng-lex-samp.evaluation.xml if downloaded from http://www.cis.upenn.edu/~cotton/senseval/corpora.tgz). XML format identical to the format used for training files above is used to encode the test data. The only difference between the training file format and the test file format is the absence of tags in the test instance. Important Note: There is a one to one correspondance between test files and training files. For each test file there is exactly one train file. 3.3> The key file: The key file is one big file containing all the instance id's in the test file and their answer senses. These senses are exactly the ones used in the training file (unlike in Senseval 1, where we have numeric answers in the training files and mnemonics in the gold files). 4. The Format of the Data as Made Available by Us: -------------------------------------------------- We have made available the English language material of the 1998 Senseval exercise in the format used for Senseval 2. Our version of the data consists of three files: 4.1> The train file: This file (train.xml) contains all the training instances in the files in the TRAIN directory (as described in section 2.1 above) in the format described in section 3.1 above. We have tried to follow, as far as possible, the philosophy of Senseval 2 in providing one training lexical element for each test lexical element. Since the original TRAIN and TEST files do not have a one to one correspondance, an effort has been made to split the training files into separate lexical elements according to the part of speech of each instance of the word. Thus, given the existence of a file word-x.eval in the TEST directory, the instances of the corresponding word.cor file are divided into lexelts word-n, word-v, word-a, etc. The only exception are files of the form word-p.eval: for such files the corresponding train file, word.cor, is converted directly into lexelt word-p, and is not split into separate lexelts. If a word.cor file is split into several word-x lexical elements, instances tagged with sense U, T or P are made available in each such lexical element (since their part of speech is unknown). However, providing an exact one-to-one correspondence between test and training data was not possible. Following is a detailed description of the data. a> Of the 41 word-x.eval files, the following 5 files do not have any corresponding training data: deaf-a.eval disability-n.eval hurdle-p.eval rabbit-n.eval steering-n.eval b> Of the remaining 36 files, the following 4 files have indeterminate parts of speech: band-p.eval bitter-p.eval sanction-p.eval shake-p.eval Hence the corresponding train files have not been split up according to part of speech. c> For the remaining 32 files, the corresponding training files are split up to obtain the part-of-speech lexical elements. As a result of the splits, a few extra lexical elements are created. These lexical elements have been retained in train.xml, but do not have corresponding lexical elements in test.xml. They are: amaze-a bother-n brilliant-n calculate-a consume-a excess-a invade-a knee-a knee-v promise-a seize-a shirt-a slight-n d> Thus, for the following 36 lexical elements, we have exactly one lexical element in train.xml and one lexical element in test.xml accident-n amaze-v band-p behaviour-n bet-n bet-v bitter-p bother-v brilliant-a bury-v calculate-v consume-v derive-v excess-n float-n float-v floating-a generous-a giant-a giant-n invade-v knee-n modest-a onion-n promise-n promise-v sack-n sack-v sanction-p scrap-n scrap-v seize-v shake-p shirt-n slight-a wooden-a Note: For a way to keep just these 36 lexical elements in the train and test xml files, see section "5.3> Script agglomerate.pl" Note: A train file corresponding to the fixed Senseval-1 data is also provided with this package: train-fix.xml. The fixes provided consist of correctly identifying numerous head words that are not properly annotated in the original version. 4.2> The test file: This file (test.xml) contains all the test instances in the files in the TEST directory (as described in section 2.2 above) in the format described in section 3.2 above. As mentioned in section 4.1 above, five words have no training file. All 41 files have been converted to lexical elements. As described in section 3.2, the only difference between the format of the test file and that of the training file is that there are no tags in the test file. Note: A test file corresponding to the fixed Senseval-1 data is also provided with this package: test-fix.xml. The fixes provided consist of correctly identifying numerous head words that are not properly annotated in the original version. 4.3> The key file: This file (key) contains the answer for each instance (with exceptions: see below) in the test file (test.xml) in the format described in section 3.3 above. These answers are taken from the GOLD files described in section 2.3 above. Important Note: Four instances do not have any answer. These are: a> bet-n.700002 b> bet-v.700380 c> consume-v.700002 d> bitter-p.700027 5. A Description of Our Conversion Scripts, and How To Customize the Conversion: -------------------------------------------------------------------------------- We have provided scripts that can be used to convert the data from Senseval 1 to Senseval 2 format in a way that suits your needs. These scripts are written in perl and are freely available under the GNU Public License. To run the scripts you need to first download the Senseval 1 data. These can be obtained from [1]. Once you download and unzip the data, you should have the following data with you: a> a directory called TRAIN with the training .cor files. b> a directory called TEST with the .eval files. c> a directory called GOLD with the gold standard taggings. d> a sensemap. The name of file available at the Senseval site is mne-uid.map Once you have the above resources ready, you can use the scripts. We have four scripts. Following is a description of the various switches and how you can use them to tailor the conversion to your needs. 5.1> Script Sval1to2.pl: ----------------------- This script takes the training and test files in Senseval 1 format and converts them into individual training and test xml files in Senseval 2 format. The instance ids and sense ids need not necessarily be numeric as is the case with Senseval-1 data. Instance ids are identified by their presence immediately after a blank line or at the first line of the file. Thus, any data which is in the Senseval-1 data format may be converted to Senseval-2 data format using this script. It may be noted that part of speech information in the form of `n', `v', `a' and `p' at the first line of the source training and test files is ignored. Certain files in the Senseval-1 data have such information. 5.1.1> Running the script with no switches gives a brief useage note. It also lets you know that further help can be obtained by running the script with the --help switch. 5.1.2> There are three compulsory switches: a> --training: to specify the directory containing the training (.cor) files. b> --test: to specify the directory containing the test (.eval) files. c> --sensemap: to specify the sensemap. 5.1.3> There are a host of optional switches: a> --split_on_POS splits the training files as described in section 4.1 above. The part of speech of each instance is detected from the sensemap. By default splitting is not done, in which case each training file results in exactly one xml file. b> --clean_up removes all capitalization, converts multiple white space to single white space, and removes new lines. c> --numeric_answers_only some answers are accompanied with hyphenated notes from the human taggers. This switch gets rid of them, and retains only the numerical portion of the answers. d> --floating this switch creates floating-a.train.xml instead of float-a.train.xml. By default float-a.train.xml would have been created, but since the test file is floating-a.eval, our experience is that having the training and test lexelt tags exactly match makes life a lot easier. 5.1.4> To obtain the data we provide, we ran the script thusly: Sval1to2.pl --training TRAIN --test TEST --sensemap mne-uid.map --split_on_POS --floating --numeric_answers_only 5.2> Script keyOne2Two.pl: -------------------------- This script converts GOLD files into a key file, as described in section 4.3. 5.2.1> It has one switch: --sensemap. This allows you to provide it with the sensemap file (mne-uid.map). If this file is provided, the answer mnemonics are converted to corresponding numerical answers. 5.2.2> We ran this script thusly: keyOne2Two.pl --sensemap mne-uid.map GOLD > key 5.3> Script agglomerate.pl: --------------------------- This script agglomerates the xml files output by Sval1to2.pl into one big train file and one big test file as described in sections 4.1 and 4.2 above. All file names should be in the format *.train.xml or *.test.xml. 5.3.1> It has two compulsory switches: --training FILE specifies the name of the output train file and --test FILE specifies the name of the output test file. Other switches: 5.3.2> Switch: --remove_unmatched. This allows you to remove all training files which do not have a matching test file and all test files which do not have a matching training file. By default nothing is deleted, and everything goes into the final xml files. To see what will be deleted and what not, look at section 4.1 above. 5.3.3> Switch: --dir DIR. Specifies the directory in which to look at for the train and test files. By default, searches in the current directory. 5.3.4> We ran this script thusly: agglomerate.pl --train train.xml.temp --test test.xml.temp --dir directory (where "directory" was the directory containing all the xml files created by Sval1to2.pl). 5.4> Script getUnique.pl: ------------------------- This script takes a file formatted in the SENSEVAL-2 format and replaces multiple occurrences of the same instance (having same instance id, but different answer tags) with a single instance having multiple answer tags. We require this script because Sval1to2.pl above creates separate copies of instances that have multiple answer tags such that each copy has a single answer tag. While this script is provided with this suite, it may be used separately too. It takes as input a SENSEVAL-2 file and outputs another one with the above modifications. The output file is sent to the stdout. We ran this script thusly: getUnique.pl train.xml.temp > train.xml. Note that since the test files do not have answer tags, Sval1to2.pl does not create multiple copies of instances, and so this program does not need to be run on the test.xml file. 6. Multiple Correct Answers in Senseval-1 Data ---------------------------------------------- There are 137 training instances that have 2 correct answers. We represent this in our Senseval-2 version of the training data (train.xml) by creating duplicate instances that only differ with respect to their answer instance tag. See instance id=accident.800038 for an example. The total number of instances in the original Senseval-1 training data is 13,127. However, since there are 137 instances with 2 correct answers, our Senseval-2 version of this data contains 13,264 instances (13127 unique instances + 137 duplicates = 13,264 instances). There are no instances that have more than 2 correct answers. There are 8,452 instances in the original test data and in our Senseval-2 version. 7. Bug Reports: --------------- We have spot-tested the data and tested the conversion scripts, and have fixed all bugs we found. However, since there is always One More Bug, we request you to keep your eyes open for such, and to let us know if you find descrepencies. Bug reports, comments and criticism can be sent to {tpederse, bane0025}@d.umn.edu. 8. References: -------------- [1] Kilgarriff, Adam. "English SENSEVAL Resources in the Public Domain." World Wide Web document, url: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html [2] "Senseval 2: Second International Workshop on Evaluating Word Sense Disambiguation Systems." World Wide Web document, url: http://www.sle.sharp.co.uk/senseval2/ 9. Copying: ----------- This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 10. Acknowledgments: ------------------- This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784) and by a Grant-in-Aid of Research, Artistry and Scholarship from the Office of the Vice President for Research and the Dean of the Graduate School of the University of Minnesota. Date of Last Update: 1/4/03 by TDP