Duluth-Shell :  C Shell scripts to replicate Duluth systems from Senseval-2  

			Copyright (C) 2002-2003
                    Ted Pedersen, tpederse@d.umn.edu  
                     University of Minnesota, Duluth 
			      February 2002 (v0.1)	

  		   http://www.d.umn.edu/~tpederse/code.html

They are provided to replicate the Duluth systems that participated in the 
Senseval-2 comparative exercise among word sense disambiguation systems. 

======================================================================

Preliminaries:
-------------- 

Before you can run these scripts that replicate the Duluth word sense 
disambiguation systems, you must have the following installed on your 
machine. All of this code is freely available: 

Weka (v3.2.1)     http://www.cs.waikato.ac.nz/~ml/weka/index.html
BSP (v0.4)        http://www.d.umn.edu/~tpederse/code.html
SenseTools (v0.1) http://www.d.umn.edu/~tpederse/code.html

Make sure that the directories where this code resides is included in 
your $PATH. You will also need to include your Weka and SenseTools 
directories in $CLASSPATH, since both use Java code. BSP v0.4 is not
the most current version of BSP, but is mentioned here since it was
the version used during Senseval-2 in spring/summer 2001. More current
versions will work as well. 

Here is how I set up my $PATH and other environment variables in my
.cshrc file:

setenv WEKAHOME $HOME/weka-3-2-1
setenv CLASSPATH $HOME/weka-3-2-1/weka.jar:$HOME/SenseTools-0.1:

set path = ($HOME/Duluth-Shell/Scripts $path)   # senseval scripts
set path = ($HOME/Duluth-Shell/Perl $path)      # senseval perl code
set path = ($HOME/bsp-v0.4 $path)               # bsp 
set path = ($HOME/SenseTools-0.1 $path)         # sensetools 
set path = ($HOME/Python-2.1 $path)             # python

(These paths will of course vary depending on where you decide to
install these various packages. I have just put them all in my $HOME
directory. Also note that all Perl programs (filenames end in .pl) 
and the Python program (scorer.python) assume that Python and Perl 
reside in /usr/local/bin/. If this is not the case you will need to
edit the first line of these programs where this is specified, or 
create an alias to this directory from the actual location.

Weka is written in Java, BSP and SenseTools are written in Perl, the
Senseval scoring program is written in Python. There is also a Java
program used in SenseTools. Thus, you'll need to have the following
programming languages available on your system. If you don't they 
are all freely available and seem to be easy to install:

Java (1.3.1)    http://www.java.sun.com
Perl (5.6.1)    http://www.perl.org
Python (2.1)    http://www.python.org

Keep in mind that these scripts and all these instructions are written
from a Unix/Linux perspective. If you are on a different platform no doubt
there are some special things that you will need to do that are not
discussed here. I have not tested these systems outside of Linux/Unix and
am not sure what would be involved to make them run on other platforms.  

======================================================================

Sense-Tagged Data Provided with this Distribution:
--------------------------------------------------

There are four directories of data included with this distribution. 
They are found in the /Data directory. Please note that all represent  
re-distributions of original Senseval data. If you have any doubts about  
the completeness, quality, etc. of this data, or  would like to carry out  
different kinds of preprocessing, please get the original data from the  
Senseval web sites. This data is just provided for your convenience, and  
is meant to make it easy to reproduce the results of the Duluth systems  
that participated in Senseval-2. 

The four data directories are:

Data-English-1:

This is the English lexical sample data for Senseval-1, formatted using
the Senseval-2 format. The scripts in this directory are all set up for
Senseval-2 words, however, there are statements commented out in all
scripts that will allow you to process the Senseval-1 data. Simply
uncomment the foreach statement that follows the comment "senseval-1
data" and comment out the foreach statement that follows the comment
"senseval-2 data". This data was converted from Senseval-1 to
Senseval-2 format using the SensevalOne2Two tool, and is also available
from http://www.d.umn.edu/~tpederse/data.html [For testing purposes a 
smaller subset of this data is provided in Test-English-1.] 


Data-English-2:

Includes English lexical sample data as preprocessed by the Duluth word 
sense disambiguation systems for Senseval-2. SenseTools includes our 
preprocessing tools so you are free to come up with your own preprocessing 
procedure. This directory also includes the stoplist and token definition 
files that we used. The program we used to create the stoplist is included
with SenseTools (stop.pl) so you are free to create your own. The token  
definition file is created manually and is based on Perl regular 
expressions. This data is available from the Senseval-2 web site and is
known as english-lex-sample. [For testing purposes a smaller subset of 
this data is provided in Test-English-2.] 

Data-English-G-2:

Includes English lexical sample data as preprocessed by the Duluth word 
sense disambiguation systems for Senseval-2. This is the same data as in 
Data-English-2, except that the sense distinctions are based on groups  
rather than senses. This was created as a part of special exercise that
occurred following Senseval-2. The group results are not reported in the
official web site. This data is available from the Senseval-2 web site and
is known as english-lex-group-sample. [For testing purposes a smaller
subset of this data is provided in Test-English-G-2.] 

Data-Spanish-2:

Includes Spanish lexical sample data as preprocessed by the Duluth word 
sense disambiguation systems for senseval-2. Sensetools includes our 
preprocessing tools so you are free to come up with your own preprocessing 
procedure. This directory also includes the stoplist and token definition 
files that we used. The program we used to create the stoplist is included
with  Sensetools (stop.pl) so you are free to create your own. The token  
definition file is created manually and is based on Perl regular 
expressions.  [For testing purposes a smaller subset of this data is 
provided in Test-Spanish-2.] 

======================================================================

Description of C shell Scripts:
-------------------------------

The scripts needed to run the Duluth systems are located in the /Scripts
directory along with the data directories mentioned above. You can run 
a Duluth system simply by executing the script of the appropriate 
name. Make sure that the /Scripts directory is in your $PATH, since you 
will probably want to run these scripts from the directory created by the 
setup script and not where the scripts reside. 

------------------------------------------------------------------
The following scripts will start the like-named Duluth system:

For English: duluth1 duluth2 duluth3 duluth4                 
             duluth5 duluthA duluthB duluthC     

For Spanish: duluth6 duluth7 duluth8 duluth9        
             duluth10 duluthX duluthY duluthZ

------------------------------------------------------------------
The script 'zeror' will run a simple majority classifier. This
determines which sense occurs most frequently in the training data and
assigns that to every instance of the word in the test data. 

The script 'zerof' is used to identify features for this classifier. 
The only feature is the sense of the words in the training data. 

------------------------------------------------------------------
The following scripts identify unigram, bigram, and co-occurrence features
in the data. These scripts could be replaced by utilizing new options in 
BSP v0.5, although we have not done that here to preserve the 
implementation of the Duluth Senseval-2 systems.  

For English: gram1 gram2 

For Spanish: gram1-spa gram2-spa

For English and Spanish: coll2   

------------------------------------------------------------------
The following scripts identify various combinations of bigram features
using BSP v0.4. Each script has fairly detailed comments describing the
particular criteria used to identify features. 

f0 f1-mod f1f3 f2 f3 g2 

------------------------------------------------------------------
The following scripts are used to set up data for processing. 

preprocess      setup

------------------------------------------------------------------
The script 'xml2arff' calls xml2arff.pl, which converts sense tagged text
into a feature vector representation. 

------------------------------------------------------------------
The following scripts are used to call Weka.

bag             wekarun

------------------------------------------------------------------
The following scripts are used to  create ensembles and final answer files
suitable for scoring by scorer.python. 

score           score-word

------------------------------------------------------------------
The following python program was used for scoring in Senseval-2. 

scorer.python

======================================================================

Misc Perl Scripts in Support of C Shell Scripts:
------------------------------------------------

../Perl:

Contains a few Perl utility programs used to deal with unigrams and 
co-occurrence features. You should have this directory in your $path too. 
Note that some of these programs were quick fixes to overcome limitations
of an earlier version the Bigram Statistics Package, and that many of
these could be eliminated with the use of BSP v0.5. However, since the
actual Senseval-2 systems were run using BSP v0.4, I have retained this
older code. Please note that you can still use BSP v0.5 and use these
older programs. 
 
======================================================================

Setting up a LexSample Directory (optional):
--------------------------------------------

The data in LexSample has been preprocessed and is ready to be used as
input by the Duluth systems. However, if you have sense tagged text that 
is in the same format as the Senseval-2 data, you can create your own 
version of LexSample data from scratch. You might also want to do
different kinds of preprocessing than we did to the Senseval-2 and
Senseval-1 data. 

The steps I followed to create the data in LexSample are as follows. You
should follow the same process, except you will need to use your own file
names and possible vary the preprocessing steps somewhat. 

First, I put the training and test/evaluation lexical sample files into 
separate directories Test and Train. The training file is called
lexical-sample.training.xml and the test/evaluation file is called
lexical-sample.test.xml. [Please note that these scripts are written such
that that your training and test files must be named *test.xml and 
*training.xml.] 

mkdir Training
cp lexical-sample.training.xml Training
cd Training
preprocess lexical-sample.training 
cd ..

mkdir Test
cp lexical-sample.test.xml Test
cd Test
preprocess lexical-sample.test
cd ..

mkdir LexSample
cp -r Test/* LexSample
cp -r Train/* LexSample

Now LexSample contains the testing and training data, where each word has 
its own directory, with separate test and training files in both xml and 
count format (4 files per word directory)
======================================================================

Running the Duluth systems:
---------------------------

To reproduce the Duluth systems that participated in Senseval-2:

Use the setup script to create a directory that contains a copy of the  
lexical sample data, the stoplist, and the token definition file.  You can
use any of the four directory files above for this. However, to replicate
the Senseval-2 results you should run setup with Data-English-2 and
Data-Spanish-2. For example, 

setup Data-English-2 Test1

This will make a copy of Data-English-2 in Test1. Note that you can  
specify path names to this script if you wish and create Test1 somewhere  
other than the /Data directory where Data-English-2 resides. 

To run any of the duluth systems, you *must* be in the directory created 
via the setup command. Given the setup command above, you should change
to the Test1 directory:

cd Test1

Then run the desired duluth system as follows:

duluthX LexSample stop-list token-def

where X=1, 2, 3, etc. and LexSample the name of the directory in which  
the sense tagged data is located. LexSample is found within the directory  
you created with the setup script and in Data-English-1, Data-English-2, 
Data-English-G, and Data-Spanish-2. Note that you should specify the full  
path name for stop-list and token-def if they are not in the directory  
you are running from. 

The following systems expect English language data:

duluth1 duluth2 duluth3 duluth4 duluth5 duluthA duluthB duluthC

These are run by submitting the scripts of the same name. Please note
that you can process the data found in the LexSample directory in 
Data-English-2, Data-English-G-2, and Data-English-1 with these scripts. 

The following systems expect Spanish language data:

duluth6 duluth7 duluth8 duluth9 duluth10 duluthX duluthY duluthZ

These are run by submitting the scripts of the same name. Please note that 
you can process the data found in the LexSample directory in Data-Spanish-2 
with these scripts. 

In reality there are only small differences between the corresponding
English and Spanish systems. In general they take the same basic approach,
perhaps using a few different parameter settings at most.

The correspondence is as follows:

duluth1 <-> duluth6
duluth2 <-> duluth7
duluth3 <-> duluth8
duluth4 <-> duluth9
duluth5 <-> duluth10
duluthA <-> duluthX
duluthB <-> duluthY
duluthC <-> duluthZ

======================================================================

Related Publications:
--------------------

The following are available from: http://www.d.umn.edu/~tpederse/pubs.html

A Baseline Methodology for Word Sense Disambiguation (Pedersen) - Appears 
in the Proceedings of the Third International Conference on Intelligent 
Text Processing and Computational Linguistics (CICLING-02) February 17-22, 
2002, Mexico City
[an extended version of the Senseval-2 workshop paper]

Machine Learning with Lexical Features: The Duluth Approach to Senseval-2 
(Pedersen) - Appears in the Proceedings of SENSEVAL-2: Second 
International Workshop on Evaluating Word Sense Disambiguation Systems 
July 5-6, 2001, Toulouse, France 
[a short paper that describes all of the Duluth systems] 
 
======================================================================

Copying:
--------

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License, or (at 
your option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License 
for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59 
Temple Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file  
GPL.txt that you should have received with this distribution. 

------------------------------------------------------------------------
Originally written by Ted Pedersen, July 18, 2001
modified Sep 24, 2001 by TDP
modified Nov 25, 2001 by TDP 
modified Feb 02, 2002 by TDP
modified Feb 10, 2002 by TDP
------------------------------------------------------------------------