Duluth-Shell : C Shell scripts to replicate Duluth systems from Senseval-2 Copyright (C) 2002-2003 Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth February 2002 (v0.1) http://www.d.umn.edu/~tpederse/code.html They are provided to replicate the Duluth systems that participated in the Senseval-2 comparative exercise among word sense disambiguation systems. ====================================================================== Preliminaries: -------------- Before you can run these scripts that replicate the Duluth word sense disambiguation systems, you must have the following installed on your machine. All of this code is freely available: Weka (v3.2.1) http://www.cs.waikato.ac.nz/~ml/weka/index.html BSP (v0.4) http://www.d.umn.edu/~tpederse/code.html SenseTools (v0.1) http://www.d.umn.edu/~tpederse/code.html Make sure that the directories where this code resides is included in your $PATH. You will also need to include your Weka and SenseTools directories in $CLASSPATH, since both use Java code. BSP v0.4 is not the most current version of BSP, but is mentioned here since it was the version used during Senseval-2 in spring/summer 2001. More current versions will work as well. Here is how I set up my $PATH and other environment variables in my .cshrc file: setenv WEKAHOME $HOME/weka-3-2-1 setenv CLASSPATH $HOME/weka-3-2-1/weka.jar:$HOME/SenseTools-0.1: set path = ($HOME/Duluth-Shell/Scripts $path) # senseval scripts set path = ($HOME/Duluth-Shell/Perl $path) # senseval perl code set path = ($HOME/bsp-v0.4 $path) # bsp set path = ($HOME/SenseTools-0.1 $path) # sensetools set path = ($HOME/Python-2.1 $path) # python (These paths will of course vary depending on where you decide to install these various packages. I have just put them all in my $HOME directory. Also note that all Perl programs (filenames end in .pl) and the Python program (scorer.python) assume that Python and Perl reside in /usr/local/bin/. If this is not the case you will need to edit the first line of these programs where this is specified, or create an alias to this directory from the actual location. Weka is written in Java, BSP and SenseTools are written in Perl, the Senseval scoring program is written in Python. There is also a Java program used in SenseTools. Thus, you'll need to have the following programming languages available on your system. If you don't they are all freely available and seem to be easy to install: Java (1.3.1) http://www.java.sun.com Perl (5.6.1) http://www.perl.org Python (2.1) http://www.python.org Keep in mind that these scripts and all these instructions are written from a Unix/Linux perspective. If you are on a different platform no doubt there are some special things that you will need to do that are not discussed here. I have not tested these systems outside of Linux/Unix and am not sure what would be involved to make them run on other platforms. ====================================================================== Sense-Tagged Data Provided with this Distribution: -------------------------------------------------- There are four directories of data included with this distribution. They are found in the /Data directory. Please note that all represent re-distributions of original Senseval data. If you have any doubts about the completeness, quality, etc. of this data, or would like to carry out different kinds of preprocessing, please get the original data from the Senseval web sites. This data is just provided for your convenience, and is meant to make it easy to reproduce the results of the Duluth systems that participated in Senseval-2. The four data directories are: Data-English-1: This is the English lexical sample data for Senseval-1, formatted using the Senseval-2 format. The scripts in this directory are all set up for Senseval-2 words, however, there are statements commented out in all scripts that will allow you to process the Senseval-1 data. Simply uncomment the foreach statement that follows the comment "senseval-1 data" and comment out the foreach statement that follows the comment "senseval-2 data". This data was converted from Senseval-1 to Senseval-2 format using the SensevalOne2Two tool, and is also available from http://www.d.umn.edu/~tpederse/data.html [For testing purposes a smaller subset of this data is provided in Test-English-1.] Data-English-2: Includes English lexical sample data as preprocessed by the Duluth word sense disambiguation systems for Senseval-2. SenseTools includes our preprocessing tools so you are free to come up with your own preprocessing procedure. This directory also includes the stoplist and token definition files that we used. The program we used to create the stoplist is included with SenseTools (stop.pl) so you are free to create your own. The token definition file is created manually and is based on Perl regular expressions. This data is available from the Senseval-2 web site and is known as english-lex-sample. [For testing purposes a smaller subset of this data is provided in Test-English-2.] Data-English-G-2: Includes English lexical sample data as preprocessed by the Duluth word sense disambiguation systems for Senseval-2. This is the same data as in Data-English-2, except that the sense distinctions are based on groups rather than senses. This was created as a part of special exercise that occurred following Senseval-2. The group results are not reported in the official web site. This data is available from the Senseval-2 web site and is known as english-lex-group-sample. [For testing purposes a smaller subset of this data is provided in Test-English-G-2.] Data-Spanish-2: Includes Spanish lexical sample data as preprocessed by the Duluth word sense disambiguation systems for senseval-2. Sensetools includes our preprocessing tools so you are free to come up with your own preprocessing procedure. This directory also includes the stoplist and token definition files that we used. The program we used to create the stoplist is included with Sensetools (stop.pl) so you are free to create your own. The token definition file is created manually and is based on Perl regular expressions. [For testing purposes a smaller subset of this data is provided in Test-Spanish-2.] ====================================================================== Description of C shell Scripts: ------------------------------- The scripts needed to run the Duluth systems are located in the /Scripts directory along with the data directories mentioned above. You can run a Duluth system simply by executing the script of the appropriate name. Make sure that the /Scripts directory is in your $PATH, since you will probably want to run these scripts from the directory created by the setup script and not where the scripts reside. ------------------------------------------------------------------ The following scripts will start the like-named Duluth system: For English: duluth1 duluth2 duluth3 duluth4 duluth5 duluthA duluthB duluthC For Spanish: duluth6 duluth7 duluth8 duluth9 duluth10 duluthX duluthY duluthZ ------------------------------------------------------------------ The script 'zeror' will run a simple majority classifier. This determines which sense occurs most frequently in the training data and assigns that to every instance of the word in the test data. The script 'zerof' is used to identify features for this classifier. The only feature is the sense of the words in the training data. ------------------------------------------------------------------ The following scripts identify unigram, bigram, and co-occurrence features in the data. These scripts could be replaced by utilizing new options in BSP v0.5, although we have not done that here to preserve the implementation of the Duluth Senseval-2 systems. For English: gram1 gram2 For Spanish: gram1-spa gram2-spa For English and Spanish: coll2 ------------------------------------------------------------------ The following scripts identify various combinations of bigram features using BSP v0.4. Each script has fairly detailed comments describing the particular criteria used to identify features. f0 f1-mod f1f3 f2 f3 g2 ------------------------------------------------------------------ The following scripts are used to set up data for processing. preprocess setup ------------------------------------------------------------------ The script 'xml2arff' calls xml2arff.pl, which converts sense tagged text into a feature vector representation. ------------------------------------------------------------------ The following scripts are used to call Weka. bag wekarun ------------------------------------------------------------------ The following scripts are used to create ensembles and final answer files suitable for scoring by scorer.python. score score-word ------------------------------------------------------------------ The following python program was used for scoring in Senseval-2. scorer.python ====================================================================== Misc Perl Scripts in Support of C Shell Scripts: ------------------------------------------------ ../Perl: Contains a few Perl utility programs used to deal with unigrams and co-occurrence features. You should have this directory in your $path too. Note that some of these programs were quick fixes to overcome limitations of an earlier version the Bigram Statistics Package, and that many of these could be eliminated with the use of BSP v0.5. However, since the actual Senseval-2 systems were run using BSP v0.4, I have retained this older code. Please note that you can still use BSP v0.5 and use these older programs. ====================================================================== Setting up a LexSample Directory (optional): -------------------------------------------- The data in LexSample has been preprocessed and is ready to be used as input by the Duluth systems. However, if you have sense tagged text that is in the same format as the Senseval-2 data, you can create your own version of LexSample data from scratch. You might also want to do different kinds of preprocessing than we did to the Senseval-2 and Senseval-1 data. The steps I followed to create the data in LexSample are as follows. You should follow the same process, except you will need to use your own file names and possible vary the preprocessing steps somewhat. First, I put the training and test/evaluation lexical sample files into separate directories Test and Train. The training file is called lexical-sample.training.xml and the test/evaluation file is called lexical-sample.test.xml. [Please note that these scripts are written such that that your training and test files must be named *test.xml and *training.xml.] mkdir Training cp lexical-sample.training.xml Training cd Training preprocess lexical-sample.training cd .. mkdir Test cp lexical-sample.test.xml Test cd Test preprocess lexical-sample.test cd .. mkdir LexSample cp -r Test/* LexSample cp -r Train/* LexSample Now LexSample contains the testing and training data, where each word has its own directory, with separate test and training files in both xml and count format (4 files per word directory) ====================================================================== Running the Duluth systems: --------------------------- To reproduce the Duluth systems that participated in Senseval-2: Use the setup script to create a directory that contains a copy of the lexical sample data, the stoplist, and the token definition file. You can use any of the four directory files above for this. However, to replicate the Senseval-2 results you should run setup with Data-English-2 and Data-Spanish-2. For example, setup Data-English-2 Test1 This will make a copy of Data-English-2 in Test1. Note that you can specify path names to this script if you wish and create Test1 somewhere other than the /Data directory where Data-English-2 resides. To run any of the duluth systems, you *must* be in the directory created via the setup command. Given the setup command above, you should change to the Test1 directory: cd Test1 Then run the desired duluth system as follows: duluthX LexSample stop-list token-def where X=1, 2, 3, etc. and LexSample the name of the directory in which the sense tagged data is located. LexSample is found within the directory you created with the setup script and in Data-English-1, Data-English-2, Data-English-G, and Data-Spanish-2. Note that you should specify the full path name for stop-list and token-def if they are not in the directory you are running from. The following systems expect English language data: duluth1 duluth2 duluth3 duluth4 duluth5 duluthA duluthB duluthC These are run by submitting the scripts of the same name. Please note that you can process the data found in the LexSample directory in Data-English-2, Data-English-G-2, and Data-English-1 with these scripts. The following systems expect Spanish language data: duluth6 duluth7 duluth8 duluth9 duluth10 duluthX duluthY duluthZ These are run by submitting the scripts of the same name. Please note that you can process the data found in the LexSample directory in Data-Spanish-2 with these scripts. In reality there are only small differences between the corresponding English and Spanish systems. In general they take the same basic approach, perhaps using a few different parameter settings at most. The correspondence is as follows: duluth1 <-> duluth6 duluth2 <-> duluth7 duluth3 <-> duluth8 duluth4 <-> duluth9 duluth5 <-> duluth10 duluthA <-> duluthX duluthB <-> duluthY duluthC <-> duluthZ ====================================================================== Related Publications: -------------------- The following are available from: http://www.d.umn.edu/~tpederse/pubs.html A Baseline Methodology for Word Sense Disambiguation (Pedersen) - Appears in the Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING-02) February 17-22, 2002, Mexico City [an extended version of the Senseval-2 workshop paper] Machine Learning with Lexical Features: The Duluth Approach to Senseval-2 (Pedersen) - Appears in the Proceedings of SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Systems July 5-6, 2001, Toulouse, France [a short paper that describes all of the Duluth systems] ====================================================================== Copying: -------- This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. ------------------------------------------------------------------------ Originally written by Ted Pedersen, July 18, 2001 modified Sep 24, 2001 by TDP modified Nov 25, 2001 by TDP modified Feb 02, 2002 by TDP modified Feb 10, 2002 by TDP ------------------------------------------------------------------------