SenseTools       
                                 ===========        
			         Version 0.3

                           Copyright (C) 2001-2003
                       Ted Pedersen, tpederse@d.umn.edu  
                       University of Minnesota, Duluth 

               Satanjeev Banerjee, Satanjeev.Banerjee@cs.cmu.edu
	                 Carnegie Mellon University

                http://www.d.umn.edu/~tpederse/sensetools.html

1. Introduction:
----------------

SenseTools is a suite of programs that convert sense-tagged training
data and untagged test/evaluation data into feature vectors suitable
for use by the Weka machine learning package [3]. The training and
test data is assumed to be formatted like the lexical-sample data as
used in the SENSEVAL-2 [4] word sense disambiguation exercise. The
features used to represent the training and test data can be manually
specified by the user, or automatically identified using the N-gram
Statistics Package [1].

This suite consists of the following programs:

preprocess.pl: Takes an xml file in SENSEVAL-2 lexical-sample format
	       and splits it apart into as many files as there are
               lexical elements in the original file. Each lexical 
	       element usually corresponds with a word used in a 
	       particular part of speech. It also does other sundry 
	       preprocessing tasks with the data.

nsp2regex.pl:  Takes n-word sequences and represents them as regular 
	       expressions. These are then used by xml2arff.pl 
	       to identify lexical features in the training and test 
	       data, and convert a lexical element files from text 
	       into feature vectors.  

xml2arff.pl:   Takes a lexical element file (created by preprocess.pl
               above) and a set of regular expressions (created by
               nsp2regex.pl above) and converts the lexical element
               file into feature vector representation. In other words,
	       the text is converted into a feature vector representation,
	       where each instance in the lexical element file becomes
	       a feature vector. The output of xml2arff.pl takes the 
	       form of an .arff file, which is a Weka specific 
	       representation of training and test data. 

WekaClassify.java:
               Takes an .arff file associated with the test/evaluation
	       data for a particular word, and a machine learning model
	       created by Weka. WekaClassify.java will classify 
	       each instance in the test file based on the specified
	       model and will output an answer file that complies with
	       the SENSEVAL-2 format for answer files. This allows us
	       to use the standard SENSEVAL-2 scoring program 
	       (scorer.python) to score our answers. 

tilde.pl:      Replaces the '%' in the senseid's in the training
	       examples by a '~'. This is required because Weka treats
	       '%' as a comment, while in SENSEVAL-2 it is included a
	       part of the senseid's for the English lexical sample
	       data (these senseid's are actually WordNet [5]
	       mnemonics, the '%' being a part of their standard
	       format).  If the senseid's do not include '%', as is
	       the case with the Spanish lexical sample data, then
	       tilde.pl is not necessary.

ensembleByDist.pl: 
	       Takes the output from a number of models created by
	       WekaClassify.java and creates a simple ensemble
	       by summing the probabilities or confidence measures 
	       associated with each possible sense of each instance.
	       It produces a single file where each possible sense
	       for each instance is assigned a score, and that score
	       is the sum of the probability or confidence measures
	       associated with that particular sense value. 

winnerTakeAll.pl: 
               Takes the output of WekaClassify.java and outputs the
	       answer that has the highest score. It will output 
	       multiple answers in the case of ties. 
               
xml2key.pl:    Takes one or more lexical element files where the correct
  	       senseid's have already been assigned to instances and creates 
	       a key file that provides the correct sense-tag for each
	       instance. This key is used with the scorer.python
  	       program used during SENSEVAL-2.

Programs with a .pl extension are written in Perl (version 5.6) while
WekaClassify.java is written in Java.

This README continues with a brief description of the format of
SENSEVAL-2 data and a very short (and contrived) example SENSEVAL-2
file. We then describe each of the above programs in detail,
explaining its working with the help of our example SENSEVAL-2 file. 


2. Format of SENSEVAL-2 Data File:
----------------------------------

The lexical sample tasks of SENSEVAL-2 are formatted using XML. Please
consult the SENSEVAL website [4] for a complete description of the
format. We describe below the facts about the format relevant to
SenseTools. 

There are two lexical sample files provided per language in
SENSEVAL-2: a training file where multiple occurrences of the target
words in the lexical sample have been manually sense tagged, and a
test/evaluation file that contains instances of the words in the
lexical sample that must be sense-tagged. The following description
applies to both these files, unless otherwise mentioned.   

The lexical sample files include XML tags that either provide
additional information about the data, or demarcate regions of the
text and give some regions special significance.  

XML tags that demarcate regions come in pairs, one to start the region
and one to end it. Usually the tag that marks the beginning of a
region looks like <keyword ...> and is called a start tag, and the one
that marks the end of the region looks like </keyword> and is called
an end tag. Informational XML tags look like  <keyword .../>. 

The first "important" tag in the file is the one that describes the
corpus language. This is a tag that denotes a region, usually the
whole file. Since it marks a region, it comes in a pair. The tag that
starts the region is <corpus lang='english'> (where 'english' denotes
the language in this file) and the tag that ends the region is
</corpus>. 

Within this tag pair there are one or more "lexical elements" or
lexelts. Each such lexelt represents all the data available for a
single word that has to be disambiguated. This region has the tag
<lexelt item="word"> in the beginning (where "word" is the word to be
disambiguated) and the tag </lexelt> at its end.

Within each lexelt region there are several "instances". An instance
consists of a sentence containing the word to be disambiguated as well
as a few sentences around the example sentence. This region is
demarcated with the tag <instance id="word.XXXXX"> in the
beginning and the tag </instance> in the end. Here, "word.XXXXX" is
the target word followed by a few digits that uniquely identify this
instance among all the instances for this word. 

Within the instance region of the "training" sample file we have one
or more answer tags that look like this: <answer instance="art.40001"
senseid="art~1:06:00::"/>. Here "word.XXXX" matches with the instance
id in the tag <instance...> while the quoted string after "senseid="
refers to a mnemonic from WordNet [5] that represents the meaning of
the target word in this instance. The "test" sample file does not have
answer tags since these tags are created via human sense-tagging.

Each instance has one <context> - </context> tag pair that marks the
start and end of the group of sentences that make up the context of
the target word. The target word itself is marked by the tags <head>
and </head>. 


3. An Example SENSEVAL-2 File:
------------------------------

The following is an example SENSEVAL-2 file that we will refer to
later in this README as example.xml 

<corpus lang='english'>
  <lexelt item="art.n">
    <instance id="art.40001">
      <answer instance="art.40001" senseid="art~1:06:00::"/>
      <answer instance="art.40001" senseid="fine_art%1:06:00::"/>
      <context>
        <head>Art</head> you can dance to from the creative group
        called Halo.
      </context>
    </instance>
    <instance id="art.40002">
      <answer instance="art.40002" senseid="art_gallery~1:06:00::"/>
      <context>
        There's always one to be heard somewhere during the summer in
        the piazza in front of the <head>art</head>gallery and town
        hall or in a park.
      </context>
    </instance>
    <instance id="art.40005" docsrc="bnc_ckv_938">
    <answer instance="art.40005" senseid="art~1:04:00::"/>
      <context>
        Paintings, drawings and sculpture from every period of
        <head>art</head> during the last 350 years will be on display.
      </context>
    </instance>
  </lexelt>
  <lexelt item="authority.n">
    <instance id="authority.40001">
      <answer instance="authority.40001" senseid="authority~1:14:00::"/>
      <context>
        Not only is it allowing certain health
        <head>authorities</head>to waste millions of pounds on
        computer systems that dont work, it also allowed the London
        ambulance service to put lives at risk with a system that had
        not been fully proven in practice.
      </context>
    </instance>
  </lexelt>
</corpus>


Here we have two lexelts, "art.n" and "authority.n", where "n" denotes
that these are noun senses of the words. We have three instances of
art with instance id's art.40001, art.40002 and art.40007
respectively, and one instance of authority with instance id
authority.40001. The first instance has two answers, while the others
have one each. 

3.1. Assumptions Made About the Input Xml File:
-----------------------------------------------

The input XML file should be valid xml! Thus, for example, tags that
appear in pairs (like <lexelt> and </lexelt>) should be correctly
paired up, and should be properly nested. Programs in this suite of
packages DO NOT check for the correctness of the input xml file and so
its behavior is undefined for "wrong" xml files.

New tags not defined in the current Senseval-2 format can be included,
as long as they follow the above restrictions. 

Multiple tags are allowed to be on the same line. However all the
characters in a single tag should be on the same line. That is a tag
(from the '<' to the '>') must not have a new-line character in it.


4. Program preprocess.pl:
-------------------------

This is the first program to use after obtaining the SENSEVAL-2 lexical  
sample data and before attempting to use any of the other programs in the  
suite. This program splits the lexical sample files up into separate files  
per lexical element, and also tokenizes the actual text to be processed so  
it can be converted to a feature vector representation more conveniently. 

Recall that the lexical-sample data for a given language consists of two  
files: the "training" file and the "test" file. Each of these consists of  
multiple lexical elements, each element containing multiple instances of a  
word to be disambiguated.  

By default, preprocess.pl creates a separate xml file for each lexical 
element, meaning that in general there is a separate file for all the
instances associated with each word. Each such xml file is in the same
general format as the lexical sample file, except that it only
contains instances for one word and is specific to one lexical element. 

Preprocess.pl also creates a "count" file for each lexical element
that only contains the text between the <context> </context> tags in
the corresponding "xml" file. This is created for the convenience of
the Ngram Statistics Package, which is geared towards plain text and
has no convenient mechanism for ignoring XML tags in data. 

Finally, preprocess.pl has the very important role of identifying
tokens in the text that occurs in the region defined by the <context>
</context> tags. Tokens are any sequence of characters that can be
defined through regular expressions. These regular expressions should be  
provided by the user; although preprocess.pl provides default regular  
expressions, we suggest that the user not depend on them... see  
README.details.txt for more on this. 

This section continues with a description of the tokenization process, and  
then a brief description of the various ways this program may be used. 


4.1 Tokenization of Text:
-------------------------

Program preprocess.pl accepts regular expressions from the user and
then "tokenizes" the text between the <context> </context> tags. This
is done to simplify the construction of regular expressions in program
nsp2regex.pl and to achieve optimum regular expression matching in
xml2arff.pl. Following is a description of the tokenization process.

The text within the <context> </context> tags is considered as one
string, the "input" string. This algorithm takes this input string and
creates an "output" string where tokens are identified and separated
from each other by a SINGLE space. Regex's provided by the user are
checked against the input string to see if a sequence of characters
starting with the first character of the string match against any of
these regex's. As soon as a we find a regular expression that does
match, this checking is halted, the matched sequence of characters is
removed from the string and appended to an "output" string with
exactly one space to its left and right. If none of the regex's match
against the starting characters of the input string, the first
character is considered a "non-token". By default this non token is
placed in angular brackets (<>) and then put into the output string
with one space to its left and right. This process is continued until
the input string becomes empty. This process is restarted for the next
instance.

For example, assume we provide the following regular expressions to
preprocess.pl:

<head>\w+</head>
\w+

The first regular expression says that a sequence of characters
starting with "<head>", having an unbroken sequence of alphanumeric
characters and finally ending with a "</head>" is a valid token. Also,
an unbroken sequence of alphanum characters makes a token.

Then, assuming that the following text occurs within the <context>
</context> tags of an instance: 

No, he has no <head>authority</head> on this!

preprocess.pl would then convert this text to: 

 No <,> he has no <head>authority</head> on this <!> 

Observe that "No", "he", "has", "no", "<head>authority</head>", etc
are all the tokens, while "," and "!" arent tokens and so have been
put into angular brackets. Further observe that each token has exactly
one space to its left and right.

One can provide a file containing regular expressions to preprocess.pl
using the switch --token. In this file, each regular expression should
be on a line of its own and should be preceeded and succeeded with '/'
signs. Further these should be perl regular expressions (see [6] for
more on what constitutes a perl regular expression).

Thus our regular expressions above would look like so: 

/<head>\w+<\/head>/
/\w+/

We shall call the file these regular expressions lie in
"token.txt". Then, we would run preprocess.pl on example.xml with this
token file like so:

preprocess.pl example.xml --token token.txt

Preprocess.pl produces two kinds of output: "xml" output and "count"
output. 


4.2. XML output: 
----------------

By default, for each lexical element "word" in the training or test
file (in the lexical sample of SENSEVAL-2), preprocess.pl creates a
file of the name "word".xml. For example for the file example.xml,
preprocess.pl will create files art.n.xml and authority.n.xml if it
is run as follows: 

preprocess.pl example.xml --token token.txt 

File art.n.xml: 

<corpus lang='english'>
<lexelt item="art.n">
<instance id="art.40001">
<answer instance="art.40001" senseid="art~1:06:00::"/>
<answer instance="art.40001" senseid="fine_art%1:06:00::"/>
<context>
 <head>Art</head> you can dance to from the creative group called Halo <.> 
</context>
</instance>
<instance id="art.40002">
<answer instance="art.40002" senseid="art_gallery~1:06:00::"/>
<context>
 There <'> s always one to be heard somewhere during the summer in the piazza in front of the <head>art</head> gallery and town hall or in a park <.> 
</context>
</instance>
<instance id="art.40005" docsrc="bnc_ckv_938">
<answer instance="art.40005" senseid="art~1:04:00::"/>
<context>
 Paintings <,> drawings and sculpture from every period of <head>art</head> during the last 350 years will be on display <.> 
</context>
</instance>
</lexelt>
</corpus>

File authority.n.xml: 

<corpus lang='english'>
<lexelt item="authority.n">
<instance id="authority.40001">
<answer instance="authority.40001" senseid="authority~1:14:00::"/>
<context>
 Not only is it allowing certain health <head>authorities</head> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> 
</context>
</instance>
</lexelt>
</corpus>

Observe of course that the text within the <context> </context> region
has been tokenized as described previously according to the regular
expressions in file token.txt.

This default behavior can be stopped either by using the switch --xml
FILE, by which only one FILE is created, or by using the switch
--noxml, by which no xml file is created.

4.3. Count output: 
------------------

Besides creating xml output, this program also creates output that can
be used directly with the program count.pl (from the Ngram Statistics
Package [1]). After tokenizing the region within the <context>
</context> tags of each instance, the program puts together ONLY these
pieces of text to create "count.pl ready" output. This is because
count.pl assumes that all tokens in the input file needs to be
"counted" and generally we are only interested in the "contextual"
material provided in each instance, and not the tags that occur
outside the <context> </context> region of text.

By default, for each lexical element "word", this program creates a
file of the name word.count. For example, for the file example.xml, we
would get the files art.n.count and authority.n.count.

File art.n.count: 

 <head>Art</head> you can dance to from the creative group called Halo <.> 
 There <'> s always one to be heard somewhere during the summer in the piazza in front of the <head>art</head> gallery and town hall or in a park <.> 
 Paintings <,> drawings and sculpture from every period of <head>art</head> during the last 350 years will be on display <.> 

File authority.n.count: 

 Not only is it allowing certain health <head>authorities</head> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> 


This default behavior can be stopped either by using the switch
--count FILE, by which only one FILE is created, or by using the
switch --nocount, by which no count file is created.

Note that the --xml/--noxml switches and the --count/--nocount
switches are independant of each other. Thus, although providing --xml
FILE or --noxml switchs produces a single xml FILE or no xml file at
all, you will still get all the count files, unless you also give the
--count FILE or --nocount switches. Similarly, providing the --count
FILE or --nocount switches does not affect the production of the xml
files.


4.4. Further Details in File README.details.txt:
------------------------------------------------

File details.txt continues with more subtle variations of running the 
program preprocess.pl as well as some extra switches that preprocess.pl  
provides for advanced uses. Specifically, file details.txt fleshes out in  
detail the effect of various regular expressions on tokenization, and some  
sundry issues with regular expression files. It also discusses the  
switches --senseid, --lexelt, --split, --seed, --removeNotToken, and  
--nontoken. 

5. Program nsp2regex.pl:
------------------------

We convert our lexical sample files into a feature vector representation  
where each instance of the lexical sample file is represented by a vector  
of binary features. Each feature is represented by a regular expression  
and, for a given instance, gets a value of 0 or 1 if the regular   
expression corresponding to the feature is matched by the text within the  
<context> </context> tags of the instance or not. While program  
xml2arff.pl does the actual conversion of files from lexical sample format  
to feature vector representation, program nsp2regex.pl is concerned with  
creating regular expressions for features.

A "feature" could be a single word, a set of words, an XML tag, or
some combination of these that occurs within the <context> </context>
region of a given instance.


5.1 Features Based on Bigrams:
------------------------------

In our examples below, we assume that our instance is the following: 

<instance id="art.40005" docsrc="bnc_ckv_938">
<answer instance="art.40005" senseid="art~1:04:00::"/>
<context>
 Paintings , drawings and sculpture from every period of <head>art</head> during the last 350 years will be on display . 
</context>
</instance>

As mentioned above, a feature could represent whether a set of words
occur or not in the instance. Say our feature is the bigram "period
of". Since it is indeed present, this feature would get a value of 1
for this instance. The input to nsp2regex would be

period<>of<>

Note that the diamond <> must terminate every word or token we wish to
match. Whatever comes after the last diamond on the line is
ignored. Note too that we must not have the <> itself as a token we
wish to match. (We describe and explain the output of nsp2regex.pl in
README.details.txt). 

Further note that the order of the words in the bigram is
important. Thus the feature "of period" would get a value of 0 for
this instance since it is not present!


5.2. Features Based on Bigrams with Intervening Skipped Tokens:
---------------------------------------------------------------

Normally, a sequence of words must be consecutive for it to exist as a
feature. As such the feature "drawings sculpture" does not exist in
the instance above since those two words are separated by "and" in the
instance.  However, we could specify in a regular expression that the
words in a feature are allowed to have other words between them. Thus
we may modify the "drawings sculpture" feature so that at most one
word is allowed between its two words. In that case this feature
exists!

This can be done in nsp2regex.pl by adding a "directive" before the
bigram in the input to the program:

@count.WindowSize=3
drawings<>sculpture<>

This specifies that the "window" within which we allow our tokens to
match (in this case "drawings" and "sculpture") is 3 words long. Thus
the tokens "drawings" and "sculpture" may be separated by at MOST one
token. Program nsp2regex.pl takes the above output and produces a
regular expression that would match in the instance above.

Similarly, if we have the following as the input to nsp2regex.pl: 

@count.WindowSize=3
Paintings<>drawings<>

then the regular expression that nsp2regex.pl produces would match the
instance above because the "," is a token too! However, for this,
nsp2regex.pl must be provided with a token file in the same way as
preprocess.pl was provided. Moreover, this should be the same file
provided to preprocess.pl so that the token definitions of the two
programs match.

One may provide the token file, say token.txt, to nsp2regex.pl using
the --token switch. This ensures that when nsp2regex.pl creates
regex's that ignore intervening tokens, its definition of a token
coincides with preprocess.pl's definition of a token. 


5.3. Skipping of "Tags", and Other Kinds of Features:
-------------------------------------------------------

By default, we assume that any string enclosed within < and > is
ignored during feature identification. This is because such strings
are either non-tokens or they are XML tags. We call this "skipping". 

Thus the feature corresponding to the tokens "of art" does get a value
of 1 in the instance above even without requiring a window of 3 tokens
since the XML tag <head> gets skipped.

However, if we specify that a feature includes a string that begins
with a < and ends with a > then we will not ignore that particular
feature. Note that since we require our input text to be XML, any
"normal" use of > or < will be designated using the XML convention of
&lt; or &gt;.  

On the other hand, sometimes we may explicitly want an XML tag in a
feature. Thus, we may have a feature like "<head>art</head>"... this
feature gets a 1 above. Similarly, the feature "<head>art during" also
gets a 1... the "</head>" is "skipped" as before.

While the features used to create feature vectors can be decided "by
hand", one could also use the Ngram Statistics Package [1]. See
README.details.txt for more on how this may be done. 


5.4. Important Note:
--------------------

Note that the regular expressions that nsp2regex.pl creates assume
that the text on which they are going to be applied has already been
tokenized, and tokens have been separated by single spaces. Thus it is
essential to run preprocess.pl on the text first before running the
regular expressions created by nsp2regex.pl on them!


5.5. Further Details in File README.details.txt:
------------------------------------------------

File details.txt continues with a detailed description of the regular
expressions created in various situations, what they match and what
they mean. Also, pointers on how to use NSP to generate features are
given. 


6. Program xml2arff.pl:
-----------------------

This program converts Senseval-2 lexical files into a feature vector
representation suitable for use with the Weka machine learning
system. The input to xml2arff.pl is a set of features specified as
regular expressions (created by nsp2regex.pl), and XML formatted
Senseval-2 lexical file(s). xml2arff.pl converts the the lexical
file(s) to a vector representation using the regular expressions. 

The vector representation follows the arff format used by Weka, a
free and open source software written in Java that implements many
machine learning algorithms. See [3] for more information on Weka as
well as on the .arff format that Weka uses; we present here a brief
description of the .arff format.   

6.1. Description of the arff format: 
------------------------------------

Recall that an "instance" in the SENSEVAL-2 lexical source files is
defined as all the data within the <instance> </instance> tag pair,
and consists of one or more sentences containing one occurrence of the
target word to be disambiguated, demarcated by the <head> </head> tag
pair. Further recall that if the input file is a training file, then
each such instance has an <answer> tag specifying the correct
senseid(s) of the target word for this instance.

The arff format also consists of "instances"; we shall distinguish
between the two kinds of instances by calling the ones in the input
lexical file "lexical instances" and those in the arff format the
"arff instances".

The output arff file has three components: the relation name, the
attributes and the data itself. 

6.1.1. The relation name: 
-------------------------

The relation name specifies a name for the dataset. It can be chosen
from the command line using the option --relation. If not chosen, it
defaults to the string "RELATION". Besides giving a name to the
dataset, it is not used for anything else.

6.1.2. The attributes: 
----------------------

The "attributes" correspond to the "features" discussed previously. If
there are N regular expressions in the input regex file, there are N+1
attributes in the output arff-format file.

Of these attributes, attributes 1 to N are derived from regex's 1 to N
respectively contained in the input regex file. Each of these N
attributes are binary: 0/1. 

Attribute N+1, the "class" attribute, corresponds to the senseid of
the target word. This attribute has as many "values" as there are
unique senseid's used in the <answer> tags of the xml source file. If
two xml source files are used, this "class" attribute has as many
values as there are unique senseid's in the <answer> tags across the
two source files.


6.1.3. The data:
----------------

The data section consists of at least as many arff instances as there
are lexical instances in the input SENSEVAL source file. Specifically,
for a training lexical file, there is one arff instance for each
answer tag of each lexical instance, while for a test lexical file,
the number of lexical instances and arff instances are same. 

There are two formats for an arff instance, the "normal" format and
the "sparse" format. We use the "sparse" format; in this format each
arff instance consists of a set of comma separated ordered pairs of
numbers. The first number of each such pair represents an attribute
number (attributes have numbers 0 upwards, and are numbered in the
same order as the order in which they occur in the @attribute
section. The second number of the pair denotes the value that this
attribute gets. Only those attributes that get a non-zero value are
mentioned; those having a value of 0 are not shown in the output. 

For example following is a typical arff instance: 

{3 1, 5 art~1:06:00::} % art.n art.40001

This says that attribute 3 has a value of 1, attribute 5 (the class
attribute) has a value of art~1:06:00:: and the other attributes have
a value of 0 each. The "% art.n art.40001" is not a part of the arff
instance as such... its a comment that is used by WekaClassify for
creating output that follows the SENSEVAL-2 answer key format. 

Note that for the (N+1)th attribute, which corresponds to the senseid
of the target word, the senseid in the answer tag of the lexical
instance is reported. In case the input file does not contain answer
tags (eg. when it is a test file) then this field is given a '?',
which implies that it is a "missing" value. In case the input file
contains more than one answer for a single lexical instance, then for
each answer a separate arff instance is created. This is because the
arff format has no mechanism to report multiple answers for the same
arff instance. In all these multiple arff instances that correspond to
different answers of the same lexical instance, every field except the
last one is the same.


6.2. Running the Program: 
-------------------------

There are two ways to run xml2arff.pl:

1) By passing a single regular expression file and a single
   lexical sample file in SENSEVAL-2 format. 

2) By passing a single regular expression file and a pair of
   lexical sample files in SENSEVAL-2 format (using the --training
   and --test options). 

In both cases, the input lexical sample files are converted into
the arff format discussed above. 

In the second case, the senseid's used as the possible values of the
(N+1)th attribute in the arff-format output of both the training and
test files are collected from the answer tags of both the input
files. This is necessary because a model created from a training file
can only be used to generate answers from a test file whose attributes
are exactly the same as that of the training file. Although the test
file does not have any answer tags and therefore no senseid's, its
(N+1)th attribute must still look exactly like the (N+1)th attribute
of the training file which will have several senseid's. Thus when one
has a training file and a test file, it is necessary to use the second
"way" above to ensure that the two output arff files do indeed have
the same attributes, each with the same range of possible values.

6.3. An Illustrative Example:
-----------------------------

Let us say that the following is our input training xml file. Let us
call this file art.n.xml.

<corpus lang='english'>
<lexelt item="art.n">
<instance id="art.40001">
<answer instance="art.40001" senseid="art~1:06:00::"/>
<answer instance="art.40001" senseid="fine_art~1:06:00::"/>
<context>
 <head>art</head> you can dance to from the creative group called halo
</context>
</instance>
<instance id="art.40002">
<answer instance="art.40002" senseid="art_gallery~1:06:00::"/>
<context>
 There always one <1> to be heard somewhere during the summer in the piazza in front of the <head>art</head> gallery and town hall or in a park 
</context>
</instance>
</lexelt>
</corpus>


Let us say we want to make regex's out of the following features, as
discussed previously:

@count.WindowSize=3
one<>to<>
during<>summer<>
the<><head>art</head><>
<head>art</head><>can<>
heard<>somewhere<>
<1><>to<>

Recall that the line "@count.WindowSize=3" implies that between the
tokens we allow at most one word to occur. Additionally of course we
would like the regex's to skip any XML tags and non-tokens that come
in between and that are not explicitly mentioned in the feature.

Passing the above to nsp2regex.pl creates a regular expression file,
each regex corresponding to one of the features above. Say this file
is called regex.txt. 

At this point if we were to run the following command: 

xml2arff.pl regex.txt art.n.xml 

we would get art.n.xml.arff which would look like so (the line numbers
have been added here for explanation purposes): 

Line  1: @relation 'RELATION'
Line  2: @attribute 'one<>to<>1' {0,1}
Line  3: @attribute 'during<>summer<>1' {0,1}
Line  4: @attribute 'the<><head>art</head><>1' {0,1}
Line  5: @attribute '<head>art</head><>can<>1' {0,1}
Line  6: @attribute 'heard<>somewhere<>1' {0,1}
Line  7: @attribute '<1><>to<>1' {0,1}
Line  8: @attribute 'class' {fine_art~1:06:00::, art_gallery~1:06:00::, art~1:06:00::}
Line  9: @data
Line 10: {3 1, 6 art~1:06:00::} % art.n art.40001
Line 11: {3 1, 6 fine_art~1:06:00::} % art.n art.40001
Line 12: {0 1, 1 1, 2 1, 4 1, 5 1, 6 art_gallery~1:06:00::} % art.n art.40002

Line 1 takes the default relation name RELATION.

Lines 2 through 7 declare the attributes corresponding to the regex's
from the regex file. Notice the names are taken from the "@name = foo"
portion of the output from nsp2regex.pl.

Line 8 lists the senseid's found in the input xml file. Here we have
three.

Line 9 declares that the data starts here.

Lines 10, 11 and 12 are the feature vector representations of the two
lexical instances. Note that this is in the sparse format mentioned
previously. Notice that attributes 0, 1, 2, 4 and 5 get a value of 0 for
instance art.40001 but get 1 for instance art.40002. Similarly,
attribute 3 gets 1 for art.40001 but 0 for art.40002. The last pair
denotes the "class attribute" and has a value picked out of the
senseid of the instance.

Since the first lexical instance, instance number art.40001, has two
answer tags, there are two arff instances corresponding to that one
lexical instance. These two instances differ only in their
senseid-attribute, that is the last attribute. The second lexical
element, instance number art.40002, has only one answer tag, and hence
has only one corresponding arff instance.

In lines 9, 10 and 11, the data after the % sign is treated as
comments in this arff format. This data specifies the lexelt and the
instance id... these are used by program WekaClassify to generate
answer files in SENSEVAL format (more on that below).

Note: On looking at the attributes "matched", one sees that the
matching has indeed proceeded as expected. "one to" has matched the
string "one <1> to" because the non-token sequence <1> is
"skipped". "during summer" has matched the string "during the summer"
because we are skipping at most one word. "<head>art</head> can" has
matched since "you" is skipped. Also, "<1> to" has also matched
because recall that non-token characters are skipped ONLY when they do
not occur in the feature itself!


7. Program WekaClassify:
----------------------------

This java program takes a Senseval test file in arff format (as output
by xml2arff.pl) and classifies the instances in the file using the
stored representation of a "model" learned by Weka from a set of training 
data. 

7.1. The Model:
---------------

WekaClassify uses a previously learned (and stored) model to classify  
instances in the given test arff file. This model has to be created by  
training some classifier (such as a decision tree, a neural network, a  
naive Bayesian classifier, etc) on training data consisting of instances  
of the same target word as that in the test arff file. Moreover, the model 
has to be created using Weka. One can create such a model in Weka by using  
the -d switch (see Weka's [3] help and documentation for more information).  
A model thus saved contains information about the type of classifier 
used as well as other facts "learned" from the training data. 

WekaClassify uses this model to classify the instances in the test 
file. The model file can be passed to WekaClassify using the -d option.  
Classification is done by calling the Weka library routines. Hence Weka  
should be available on the java CLASSPATH for this program to run  
properly.

7.2. Other Command-line Options:
--------------------------------

Other command-line options include the -t switch to specify the test
file and the -p switch to specify the level of precision required. By
default, values are output up to 4 places of decimal.

7.3. The Output of WekaClassify:
--------------------------------

For each instance in the test file, WekaClassify outputs a probability  
distribution over all the possible 'class' values. As expected, these  
values range from 0 (implying this particular instance has zero  
probability of belong to this particular class) to 1 (implying this  
particular instance belongs to this particular class). For SENSEVAL test  
files, "class" values are the possible senseid values that the target word  
in the test data may assume.

This output is in the answer file format required by SENSEVAL (and
used by the scorer.python program). Following is the format of the
output of WekaClassify:

<lexelt item> <instance id> <class/probability> ...

For example, suppose for our art.n.xml.arff file above we find the following:

art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0
art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50

For art.4001, this tells us that the first two senses have zero
probability of occurring, while the third sense has 100 percent
probability of occurring.  

8. Program xml2key.pl:
----------------------

This program takes one or more training files in the SENSEVAL format
and creates a "key" file that contains the correct sense assignments
as taken from the training files. This key file can be directly used
with scorer.python as used in SENSEVAL-2. The <answer> tags in the
training files are used to generate this output.

For example for the art.n.xml file above, we would get the following
output key file: 

art.n art.40001 art~1:06:00:: fine_art~1:06:00:: 
art.n art.40002 art_gallery~1:06:00:: 


9. Program tilde.pl: 
--------------------

We represent WordNet senseid's for the English lexical sample data of
Senseval-2 like "art~1:06:00::". However, the actual format in WordNet
is "art%1:06:00::". It is an unfortunate coincidence that the arff
format of Weka treats the % sign as a comment, so if we don't remove
this from the senseids we will find our arff instances prematurely
terminated by this comment. Thus, we have replaced the "%" sign with
the "~" sign early in the process, usually immediately before or after
running preprocess.pl. We then replace the "~" sign the "%" sign after
WekaClassify.java has run, so that our results have the same form as was  
standard in Senseval-2 and can be scored against the "official" Senseval-2  
key files. 

This program takes an input file and looks for a string of characters
that look like a senseid (either ~ or % followed by the sequence
numbers, :, numbers, : numbers, :). If the first character is a ~ it
is replaced with a %, and vice versa. Thus this program goes both
directions... both to replace % with ~ and ~ with %. 

For example, if the key file above is run through tilde.pl, we would get: 

art.n art.40001 art%1:06:00:: fine_art%1:06:00:: 
art.n art.40002 art_gallery%1:06:00:: 

Similarly, if the training file is run through tilde.pl, the senseid's
should get suitably converted. Thus running example.xml through
tilde.pl gives: 

<corpus lang='english'>
<lexelt item="art.n">
<instance id="art.40001">
<answer instance="art.40001" senseid="art%1:06:00::"/>
<answer instance="art.40001" senseid="fine_art%1:06:00::"/>
<context>
 <head>art</head> you can dance to from the creative group called halo
</context>
</instance>
<instance id="art.40002">
<answer instance="art.40002" senseid="art_gallery%1:06:00::"/>
<context>
 There always one <1> to be heard somewhere during the summer in the piazza in front of the <head>art</head> gallery and town hall or in a park 
</context>
</instance>
</lexelt>
</corpus>


10. Program ensembleByDist.pl:
------------------------------

This program creates an "ensemble" of results by adding together the
output from a number of different models produced by Weka. Each of
these models can be based on different learning algorithms and/or
feature sets, and the hope is that their combined performance will be
better than any of the individual models. Each model outputs a
distribution of probabilities or confidence measure for each possible
sense of each instance, and these are all summed together to arrive at
a final score for each sense of each instance. Normally the sense with
the highest score would be assigned to the instance, and this is
carried out by winnerTakeAll.pl.

This program expects a certain directory structure. We shall explain
the directory structure through an example. Assume that we have
represented the training data using two different feature sets,
Bigrams and Trigrams. Further assume that for each feature set, we
have trained two classifiers, say a NaiveBayesian classifier and
a DecisionTree. Finally say we have two words to disambiguate, word-1
and word-2, representing two different tasks. Our directory structure
should be as follows:  

Bigrams
  word-1
    NaiveBayesian.answer.dist
    DecisionTree.answer.dist
  word-2
    NaiveBayesian.answer.dist
    DecisionTree.answer.dist
Trigrams
  word-1
    NaiveBayesian.answer.dist
    DecisionTree.answer.dist
  word-2
    NaiveBayesian.answer.dist
    DecisionTree.answer.dist


where, Bigrams and Trigrams are our top-most directories. Each
contains one directory each for the two tasks, word-1 and word-2. Each
word directory contains one file each for the two classifiers that
contains the sense assignments as reported by WekaClassify.java. 
These filenames must end with ".answer.dist". Thus in our example, both
word-1 and word-2 have four answer files each: NaiveBayesian using
Bigrams, NaiveBayesian using Trigrams, DecisionTree using Bigrams and
DecisionTree using Trigrams. 

Each of these ".answer.dist" files should be in the format expected by
scorer.python. (Program WekaClassify produces outputs in this
format). Thus for each instance in task word-1, there are four
probability distributions over the possible classes in the four
files. Program ensembleByDist.pl will add these probability
distributions and output the total for each. 

For each word, ensembleByDist.pl outputs a file consisting of a single
distribution  based on the sum of all the input distributions named:
task.ensemble.answer.dist. 

Thus in the above example, it would create two such output files:
word-1.ensemble.answer.dist and word-2.ensemble.answer.dist. 

For example let the following be the files: 

Bigrams/word-1/NaiveBayesian.answer.dist : 

word-1 word-1.40001 A/0.0 B/0.0 C/1.0
word-1 word-1.40002 A/0.25 B/0.25 C/0.50

Bigrams/word-1/DecisionTree.answer.dist : 

word-1 word-1.40001 A/1.0 B/0.0 C/0.0
word-1 word-1.40002 A/0.75 B/0.25 C/0.0

Trigrams/word-1/NaiveBayesian.answer.dist : 

word-1 word-1.40001 A/0.25 B/0.50 C/0.25
word-1 word-1.40002 A/0.0 B/1.0 C/0.0

Trigrams/word-1/DecisionTree.answer.dist : 

word-1 word-1.40001 A/0.0 B/0.5 C/0.5
word-1 word-1.40002 A/0.85 B/0.15 C/0.0

Bigrams/word-2/NaiveBayesian.answer.dist : 

word-2 word-2.40001 A/1.0 B/0.0 
word-2 word-2.40002 A/0.25 B/0.25 

Bigrams/word-2/DecisionTree.answer.dist : 

word-2 word-2.40001 A/0.1 B/0.9 
word-2 word-2.40002 A/0.25 B/0.75 

Trigrams/word-2/NaiveBayesian.answer.dist : 

word-2 word-2.40001 A/0.4 B/0.6 
word-2 word-2.40002 A/0.5 B/0.5 

Trigrams/word-2/DecisionTree.answer.dist : 

word-2 word-2.40001 A/0.21 B/0.79 
word-2 word-2.40002 A/0.52 B/0.48 


Then upon running ensembleByDist.pl like so: 

ensembleByDist.pl Bigrams Trigrams

we get the following files:


word-1.ensemble.answer.dist :

word-1 word-1.40001 A/1.25 B/1 C/1.75
word-1 word-1.40002 A/1.85 B/1.65 C/0.5

word-2.ensemble.answer.dist :

word-2 word-2.40001 A/1.71 B/2.29
word-2 word-2.40002 A/1.52 B/1.98


Note that the final output will likely not be a probability
distribution over the possible classes in the strict sense since a
normalization is not done. However, if these values are probabilities
then scorer.python will do its own normalization. Also note that some
of the Weka learning methods do not output probabilities, but rather
some type of confidence measure. It is possible to add apples and
oranges (probabilities with confidence measures) so a user should be
certain that the ensemble as constructed is reasonable, and that the
summing operation that underlies it is appropriate. 
 

11. Program winnerTakeAll.pl:
-----------------------------

The output of WekaClassify provides a distribution of probabilities or  
confidence measures over all the possible values of the sense/class  
feature for an instance. This program takes the output from  
ensembleByDist.pl as input and outputs only the sense that has the highest  
summed score for each instance. If there is a tie for the highest value  
then all senses that share this value are output. 

For example given the following WekaClassify output: 

art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0
art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50

winnerTakeAll.pl will output the following: 

art.n art.40001 art~1:06:00::
art.n art.40002 art~1:06:00::

This default behavior can be changed by changing the "margin" that is
used to find out the winner. If such a margin 'm' is provided, an
answer is output if the difference between its probability and the
probability of the best answer is less than or equal to 'm'. The
default margin is 0.

For example if a margin of 0.6 was used in the example above, we would
have got the following from winnerTakeAll.pl: 

art.n art.40001 art~1:06:00::
art.n art.40002 art~1:06:00:: fine_art~1:06:00:: art_gallery~1:06:00::

The output is in the format required by scorer.python. A probability
value is not attached to the output answer(s), and therefore equal
probability is assumed for all answers that make it to the output.

12. Replicating Results from Senseval-2 for Duluth Systems
----------------------------------------------------------

The Duluth systems that participated in the Spanish and English
lexical sample tasks in Senseval-2 can be replicated through the use
of SenseTools (v0.1), the Bigram Statistics Package (v0.4), and Weka 
(v3.2). There are a set of C shell scripts for Unix/Linux that tie
these tools together and replicate the Senseval-2 systems. 

These shell scripts are distributed with SenseTools and can be found 
in the Scripts subdirectory.  

SenseTools-0.3 is not compatible with the original Duluth systems,
but must rather be used with the more recent version of the DuluthShell
(v0.3), NSP (v0.55), and Weka (3.2.2).

Some of the Duluth systems that participated in Senseval-2 are described 
in [2]. 

13. Copying:
------------
 
This suite of programs is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
 
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details. 
 
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307,
USA. 
 
Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.


14. Acknowledgments: 
---------------------

This work has been partially supported by a National Science
Foundation Faculty Early CAREER Development award (#0092784) and by a
Grant-in-Aid of Research, Artistry and Scholarship from the Office of
the Vice President for Research and the Dean of the Graduate School of
the University of Minnesota.

15. References:
---------------

1. S. Banerjee and T. Pedersen, The Design, Implementation, and Use of the  
   Ngram Statistics Package - Appears in the Proceedings of the Fourth 
   International Conference on Intelligent Text Processing and  
   Computational Linguistics, February 17-21, 2003, Mexico City 

   The code can be obtained from http://www.d.umn.edu/~tpederse/nsp.html

2. T. Pedersen. A baseline methodology for word sense disambiguation.
   Appears in the Proceedings of the Third International Conference
   on Intelligent Text Processing and Computational Linguistics
   (CICLING-02) February 17-22, 2002, Mexico City

3. Weka 3 - Machine Learning Software in Java. 
   World Wide Web site: http://www.cs.waikato.ac.nz/ml/weka/.

4. SENSEVAL-2: Second International Workshop on Evaluating Word Sense
   Disambiguation Systems. 
   World Wide Web site: http://www.sle.sharp.co.uk/senseval2/. 

5. C. Fellbaum. (ed). 1998. WordNet: An Electronic Lexical
   Database. The MIT Press. 

6. L. Wall, T. Christiansen and J. Orwant. 2000. Programming
   Perl. O'Reilly.