Conversion of Senseval-1 data to Senseval-2 Format: A Readme DRY RUN DATA VERSION ============================================================ Satanjeev Banerjee and Ted Pedersen University of Minnesota, Duluth {bane0025, tpederse}@d.umn.edu 9th September, 2001 Introduction: ------------- Prior to the regular Senseval 1 competition, there was a preliminary corpus of sense-tagged text made available to participants, known as the DRY RUN data. This data is in a somewhat different format than the data used in the actual Senseval 1 competition, so the script sensevalOne2Two.pl does not work on this data. The script dryrun2sval2.pl is provided to convert the DRY RUN data into the Senseval 2 format. See [2] for further details the data format. This script is written in Perl and is freely available under the GNU Public License. Note the the Senseval 1 DRY RUN data is only available by request from the Senseval coordinators. See [1] for contact information. Once you obtain and unpack this data, you will likely find it in a directory called CORPUS that contains 38 .cor files that have been manually tagged with HECTOR senses. Usage: ------ dryrun2sval2.pl This script takes the training files in the Senseval1 dry run format and converts them into a single xml file in Senseval 2 format. > Running the script with no switches gives a brief useage note. It also lets you know that further help can be obtained by running the script with the --help switch. > There is one manadatory switch: a> --dir specifies the directory in which the dry run data resides. > There are two optional switches: a> --clean_up removes all capitalization, converts multiple white space to single white space, and removes new lines. b> --numeric_answers_only some answers are accompanied with hyphenated notes from the human taggers. This switch gets rid of them, and retains only the numerical portion of the answers. > To obtain the data we provide, we ran the script thusly: dryrun2sval2.pl --dir CORPUS > dry-run.xml Bug Reports: ------------ We have spot-tested the data and tested the conversion scripts, and have fixed all bugs we found. However, since there is always One More Bug, we request you to keep your eyes open for such, and to let us know if you find descrepencies. Bug reports, comments and criticism can be sent to {tpederse, bane0025}@d.umn.edu. References: ----------- [1] Kilgarriff, Adam. "English SENSEVAL Resources in the Public Domain." World Wide Web document, url: http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html [2] "Senseval 2: Second International Workshop on Evaluating Word Sense Disambiguation Systems." World Wide Web document, url: http://www.sle.sharp.co.uk/senseval2/ Copying: -------- This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. Acknowledgment: ---------------- This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784) and by a Grant-in-Aid of Research, Artistry and Scholarship from the Office of the Vice President for Research and the Dean of the Graduate School of the University of Minnesota.