README
------
SyntaLex Version 0.1
Copyright (C) 2001-2002
Mohammad Saif, smm@cs.toronto.edu
University of Toronto
Ted Pedersen, tpederse@d.umn.edu
University of Minnesota, Duluth
##################### LAST UPDATED: Feb, 2004 #######################
SyntaLex is a perl package to do word sense disambiguation of data in
Senseval-2 data format, using various lexical and syntactic features.
We also provide a perl program which calculates various measures that
may be used to determine the redundancy and complementarity of two
sets of classifications.
######################################################################
1. INTRODUCTION:
================
Word sense disambiguation is the process of identifying the correct
sense, or meaning, of a word based on its context. Context may be
defined as the sentence housing the target word and possibly one or
more sentences surrounding it. The phenomenon of a word having
multiple meanings, or senses, is known as polysemy and such words are
said to be polysemous. For example, consider the noun `canteen'. It
has two senses: `a flask to carry liquids' and `a cafeteria'. It is
thus polysemous. Usually the intended sense of a polysemous word may
be identified based on its context. For example:
The prisoner was abandoned in the desert with just one canteen of
water
In the above sentence, `canteen' has clearly been used in the `flask
to carry liquids' sense. While this is obvious to humans, creating an
automated system which identifies the intended sense of a word based
on its context remains a challenge.
Numerous lexical and syntactic clues, or features, of the context may
be used for automated word sense disambiguation. Lexical features such
as surface form of a word, unigrams and bigrams are features which may
be readily captured from raw text. Features which rely on a more
enriched corpus such as a part of speech tagged and/or parsed text are
known as syntactic features. SyntaLex is a perl implementation to do
word sense disambiguation using individual or a combination of lexical
and syntactic features. It uses Waikato Environment for Knowledge
Analysis' (WEKA's) C4.5 decision tree learning algorithm to learn
decision trees. Only slight variations in the script will allow use of
other learning algorithms. The package uses the N-gram Statistics
Package (NSP) to capture unigrams and bigram features. The package is
an extension of the Duluth Shell system and utilizes certain perl
programs from it and the SenseTools package.
2. LOCATION OF FILES:
=====================
Unpacking this package creates the SyntaLex directory within which all
files and directories exist. All shell scripts are in the `Scripts'
directory while all perl programs are in the `Perl' directory. Thus,
if $HOME represents the directory you have unpacked the package in,
all shell scripts will be in $HOME/SyntaLex/Scripts and all the Perl
programs in $HOME/SyntaLex/Perl. $HOME/SyntaLex/DataSensitive has
certain files which may be used by the master scripts. These files are
sensitive to the data being used. Depending on which data is being
used, appropriate files are to be copied from
$HOME/SyntaLex/DataSensitive to $HOME/SyntaLex. Modified versions of
these files or completely different files may be used. The files in
this directory are provided for the users convenience. `duluthCombo'
utilizes the following text files: fine.key, stop.list, token.txt,
nontoken.txt and sensemap. All these files must be present in
$HOME/SyntaLex. `fine.key' and `sensemap' depend on the data being
used. Files corresponding to Senseval-2, Senseval-1, "line", "hard",
"serve" and "interest" are provided in the
$HOME/SyntaLex/DataSensitive directory. They are named as follows:
Sense-Tagged Data fine.key sensemap
----------------- -------- --------
Senseval-2 data Senseval2-fine.key Senseval2-sensemap
Senseval-1 data Senseval1-fine.key dummy-sensemap
"line" data line-fine.key dummy-sensemap
"hard" data hard-fine.key dummy-sensemap
"serve" data serve-fine.key dummy-sensemap
"interest" data interest-fine.key dummy-sensemap
As a default, $HOME/SyntaLex has files corresponding to Senseval-2
data. Move in appropriate files from $HOME/SyntaLex/DataSensitive for
experiments with other data.
$HOME/SyntaLex/LexSample must have the test and training data. The
files corresponding to each lexical element must be placed in a
directory by the name of the lexical element. These directory names
must be suffixed by `.s' (as in Senseval-2 data) or `-s' (as in
Senseval-1 data), where, `s' stands for the syntactic category
corresponding to the lexical element. `s' may take any of the
following forms: `n' for nouns, `v' for verbs, `a' for adjectives and
`p' for indeterminates (lexical elements whose data instances belong
to more than one or unknown part of speech). The directory for each
lexical element must have the test and training xml and count files.
To exemplify the above requirements, the $HOME/SyntaLex/LexSample may
have directories named: `art.n', `authority.n', `begin.v' and
`blind.a'. `art.n' and `authority.n' correspond to noun lexical
elements, `begin.v' corresponds to verb instances of `begin' and
`blind.a' corresponds to adjective instances of `blind'. `art.n'
itself must have the following files: art.n-test.count,
art.n-test.xml, art.n-training.count and art.n-training.xml. Similar
files must exist in all other lexelt directories.
3. NECESSARY SOFTWARE:
======================
The following packages must be downloaded and unpacked before running
the master scripts.
a. SyntaLex: Download and unpack the SyntaLex package. It is available at:
http://www.cs.toronto.edu/~smm/research.html
Let the directory where SyntaLex is unpacked be $HOME. The package
unpacks into the directory SyntaLex within which all other files are
created. Thus, all the SyntaLex files are within $HOME/SyntaLex. The
following paths and environment variable are to be set:
setenv SyntaLexHOME $HOME/SyntaLex
set path = ($SyntaLexHOME $path)
set path = ($SyntaLexHOME/Scripts $path)
set path = ($SyntaLexHOME/Perl $path)
b. SenseTools: The SenseTools package is to be downloaded from:
http://www.d.umn.edu/~tpederse/sensetools.html
We used version 0.3. Any version above will be compatible, as well.
Let the directory where SenseTools is unpacked be $HOME. A directory
named as per the version of SenseTools is created here and all other
files are created within it. Thus, I have all SenseTools files in
$HOME/SenseTools-0.3 The following path and environment variable are
to be set:
setenv SENSEHOME $HOME/SenseTools-0.3
set path = ($SENSEHOME $path)
c. WEKA: A stable version of WEKA is to be installed. We have used
weka-3-2-3 (GUI version) and any version above it should be
compatible, as well. It is available at:
http://www.cs.waikato.ac.nz/ml/weka/
The files may be unpacked by the following commands:
jar -xvf weka-3-2-3.jar
jar -xvf weka.jar
jar -xvf weka-src.jar
Note: weka.jar and weka-src.jar are created within weka-3-2-3
directory, which is created by the first jar command. Let the
directory where WEKA is unpacked be $HOME. Then the a directory named
as per the version of weka used, is created here and all the weka
files and subdirectories are created within it. Thus, I have all the
files in $HOME/weka-3-2-3. The following paths and environment
variables must be set.
setenv WEKAHOME $HOME/weka-3-2-3
set path = ($WEKAHOME $path)
setenv CLASSPATH $WEKAHOME/weka.jar:SENSEHOME
d. DuluthShell: Install DuluthShell from:
http://www.d.umn.edu/~tpederse/senseval2.html
Like the two packages above, unpacking DuluthShell creates a directory
named as per its version. I have used version 0.3. Thus, all
DuluthShell files will be within $HOME/DuluthShell-v0.3. Set paths as
follows:
set path = ($HOME/DuluthShell-v0.3/Scripts $path)
set path = ($HOME/DuluthShell-v0.3/Perl $path)
e. NSP: The N-gram Statistics Package is available at:
http://www.d.umn.edu/~tpederse/nsp.html
We have used version 0.59. All newer versions will be compatible, as
well. Let the directory where NSP is unpacked be $HOME. Then the a
directory named as per the version of NSP used, is created here and
all the NSP files and subdirectories are created within it. Thus, I
have all the NSP files in $HOME/nsp-v0.59. The following paths and
environment variables must be set.
setenv NSPHOME $HOME/nsp-v0.59
set path = ($NSPHOME $path)
set path = ($NSPHOME/Measures $path)
set path = ($NSPHOME/Utils $path)
f. Perl, version 5.6.1 or above, and Python, version 2.2 or above,
must be installed and their paths set. Note, you may have to change
the first line of scorer.python ( a program in Duluth-Shell) to
reflect the correct path of python. For example if python is installed
in /usr/bin/python2.2 then the first line of scorer.python must be:
#!/usr/bin/python2.2
All the perl programs assume the Perl software to be at `/usr/bin'. If
this is not the case requisite changes will need to be made to the
first line of the perl programs. An alternative to changing the code
for this purpose is to alias `/usr/bin/perl' to the appropriate perl
directory. For eg. if the perl software is located in the
`usr/local/bin/perl' directory, the following command aliases
`usr/bin/perl' to `/usr/local/bin/perl'.
alias /usr/bin/perl /usr/local/bin/perl
4. SCRIPTS:
===========
Two scripts wsd-setup and duluthCombo may be used to take you through
the complete process of pre-processing, capturing features, learning
decision trees, running them on the test data and evaluation of the
sense tagging. Following are their usage details:
a. wsd-setup: This script pre-processes the sense-tagged data in
Senseval-2 data format to make it suitable for word sense
disambiguation using WEKA. Its functions are as follows:
1. Create a new directory, say SETUP, which has the sense-tagged
data and files needed by duluthCombo such as fine.key, sensemap,
token and nontoken files.
2. Replace the percent symbol in the answer ID tag of a Senseval-2
data format training file with `~'.
3. Eliminate xml tags which suggest that the instance is a proper
noun. For example:
P stands for proper noun and is not another sense of the target
word. This is to avoid considering P as a sense of the instance.
Usage: wsd-setup SETUP
SETUP : The new directory created by wsd-setup which holds all files
needed by duluthCombo and the sense-tagged data. All these files are
copied from $SyntaLexHOME. Only the sense-tagged data is pre-processed
as above, the rest remain unchanged. The files copied are as follows:
a. fine.key : This key file which has the manually tagged senses
for the test data.
b. sensemap : This file is used for finer grained distinctions
among the senses.
Further details of the key file and sensemap file may be found at:
http://www.sle.sharp.co.uk/senseval2/Scoring/
c. token1.txt : This file is used to identify the valid tokens in
the data, by the NSP package. Used in detecting bigram and
unigram features.
d. non-token.txt : This file is used to identify invalid tokens
and remove them from consideration, by the NSP package. Used in
detecting bigram and unigram features.
e. stop.list : This file is used by NSP take care of non-content
words while identifying bigrams and unigrams.
f. LexSample : This directory must have the sense-tagged to be
used for word sense disambiguation. It must have the count and xml
files of test and training data of each lexical element withing a
directory named as per the lexical element and suffixed with the
appropriate part of speech corresponding to the instances as
described in section 2.
NOTE: $SyntaLexHOME by default has files for Senseval-2 data and
the LexSample directory is kept empty. To run experiments on
Senseval-2 data, all that needs to be done is to copy the data in
LexSample and run the master scripts. To run experiments on any
other data, copy the files mentioned above from the DataSensitive
directory to $SyntaLexHOME. Here is what I would do to
disambiguate `line' data.
cp DataSensitive/line-fine.key ./fine.key cp
DataSensitive/dummy-sensemap ./sensemap wsd-setup SETUP
b. duluthCombo: Once we have the pre-processed sense-tagged data and
relevant files as created by wsd-setup, duluthCombo may be used to
perform word sense disambiguation using WEKA and individual or
combination of lexical and syntactic features.
Usage: duluthCombo FEATURES SETUP STOPLIST TOKEN NONTOKEN MINCOUNT
OUTPUT PARSEFEATURES [POSFEATURES]
FEATURES : set of features which are to be used for WSD. Decision
trees are created corresponding to each feature set separated from
the others by a hyphen (-). Following are some of the feature
sets that may be specified:
f00 Bigrams f4 Unigrams f5 Surface Form f6 Part of speech f7 Parse
SETUP : directory created by wsd-setup which has all the relevant
files and data.
STOP-LIST : text file of stop words. This file must be in
SETUP. stop.list is the file created by wsd-setup.
TOKEN : text file of token definitions. This file must be in
SETUP. token.txt is the file created by wsd-setup.
NONTOKEN : text file of non-token definitions. This file must be
in SETUP. nontoken.txt is the file created by wsd-setup.
MINCOUNT : the minimum number of times the features must exist in
the training data for them to be selected.
OUTPUT : Output directory in which all output files will be moved
to. This directory is created by the script and must not exist
presently.
PARSEFEATURES : specifies the parse features which are to be used
to create the parse features decision tree. It is to be specified
via hyphen separated binary bits of the form 4-3-2-1 such that a
value of 1 suggests the use of particular parse feature while 0
implies that no parse feature will be used as part of decision
tree. The bit positions correspond to the following parse
features:
1 head word
2 phrase
3 head of parent phrase
4 parent phrase
0 No Parse Features
POSFEATURES : specifies the positions of the part of speech
features which are to be used. Individual positions are specified
as L1, L2, L3 and so on or R1, R2, R3 and so on. L1 corresponds to
the part of speech of words one position to the left of the target
word, L2 the part of speech of words two positions to the left of
the target word, and so on. R1 stands for the part of speech of
the target words (First pos tag to the right of the head word).
R2 corresponds to the pos of words one position to the right of
the target word (Second pos tag to the right of the head word), R3
is pos of word two positions to the right of the target word
(Third pos tag to the right of the head word), and so on. A
position may be a combination of two individual positions, say R1
and R2. Then the program learns features such as R1 is a
determiner AND R2 is a noun. Such combination positions are to be
specified as follows - A:R1-R2 Here, A: stands for a combination
feature (And). The positions combined are hyphen separated.
NOTE: Since all output files are moved to OUTPUT, you may run
wsd-setup just once but run duluthCombo a multiple number of times
with varying parameters.
NOTE: stop.list, token.txt, and nontoken.txt are the default files
created by wsd-setup which may/may-not be used by duluthCombo. If you
wish to use other files, place the files in SETUP.
`sample-runs' shows how I have used the two master scripts to
disambiguate Senseval-2 data. Note, there are many other combinations
of lexical nd syntactic features that may be used. The commands in
this file are just to depict how one may use these master scripts.
Once the paths are set, the two master scripts may be run from any
directory.
5. Complementarity and Redundancy:
==================================
We provide the perl program `red.pl' which may be used to compare two
disparate sets of classifications of the same test data. Specifically,
the values provide insight into the complementarity and redundancy of
the systems which produce the two classifications.
The program takes as input GOLD which gives the intended sense for
given test instances. It also takes the files SYSTEM1 and SYSTEM2
which were produced by two systems intended to do word sense
disambiguation. SYSTEM1 and SYSTEM2 have the classifications of the
test instances as done by the individual systems. The program
calculates and prints the following:
(a) Total number of instances in GOLD. (b) Number of instances
attempted - system 1. (c) Number of instances attempted - system 2.
(d) Number of instances correct - system1. (e) Number of instances
correct - system2. For each instance, each sense id suggested by the
system is matched with the sense ids provided for that instance in
GOLD. If for every match we increment x by 1 and the system suggests y
senses in all, the score for that instance is x/y. This score is added
up for all the instances to determine (symbolically) the total number
of instances disambiguated correctly by the system.
(f) Number of instances correctly tagged by either system1 or system2.
A cumulative sum of the maximum scores for each instance as attained
by the two systems determines (f).
(g) Optimal Ensemble. The ratio of the number of instances correctly
tagged by either system1 or system2 to the total number of instances
is the Optimal Ensemble.
(h) Number of instances tagged the same by systems 1 and 2 (similarity
in tagging by dice). The senses assigned to each instance by the
two systems are compared. Let x be the number of senses assigned
by both systems to an instance. Let y and z be the total number
of senses assigned to it by the two systems individually. The
similarity of tagging for the instance is then determined by the
formula: 2*x/(y+z) This similarity summed over all the instances
yields effectively, the number of instances tagged alike by the
two systems. A percentage value based on the total number of
instances is also provided.
(i) Agreement
Number of instances that are assigned at least one sense in common by
Systems 1 and 2. A percentage value based on the total number of instances is also
provided.
(j) A subset of the instances which satisfy (i) that are correct.
For each instance that satisfies the condition in (i), a correctness
score is calculated. Let y be incremented every time a sense, say s,
provided by system 1 is also provided by system 2. Let x be incremented
only when the previous condition is met and s is provided in GOLD for
that instance. Then the correctness score is defined as x/y. A
cumulative sum of the correctness score for all the instances that
satisfy the condition specified in (i) produces (j).
(k) Precision on Agreement.
If (j) evaluates to x and (i) evaluates to y, Precision on Agreement
is defined to be x/y.
(l) Conservative ensemble. The ratio of instances correctly tagged by
both systems to the total number of instances is defined to be the
Conservative Ensemble.
The program has the following usage.
USAGE : red.pl [OPTIONS] GOLD SYSTEM1 SYSTEM2
OPTIONS:
--version Prints the version number.
--help Prints help message.
6. wsd-setup DETAILS:
=====================
Apart from creating the SETUP directory and copying relevant files,
wsd-setup calls the following programs to pre-process the sense-tagged
data:
6.1 percent2tilda.pl: This program replaces the percent symbol in the
answer ID tag of a Senseval-2 data format training file SOURCE, with
`~'. The output is printed on stdout and may be captured by
redirection.
USAGE : percent2tilda.pl [OPTIONS] SOURCE
OPTIONS:
--version Prints the version number.
--help Prints help message.
6.2 removePtag.pl: This program takes a Senseval-2 format training
file as input (SOURCE). It eliminates those xml tags which suggest
that the instance is a proper noun. For example:
P stands for proper noun and is not another sense of the target word.
USAGE : removePtag.pl [OPTIONS] DESTINATION SOURCE
OPTIONS:
--version Prints the version number.
--help Prints help message.
6. duluthCombo DETAILS:
=======================
duluthCombo calls upon various programs and support scripts which in
turn call other scripts and programs. Following is description of the
scripts and programs:
7.1 Support Scripts:
--------------------
The scripts discussed in 7.1.1 through 7.1.7 have a common usage
described below:
Usage: f TASK STOPLIST TOKEN NONTOKEN MINCOUNT PARSEFEATURES
[POSFEATURES]
NUM : NUM is an element of the set {00, 4, 5, 6, 7},
corresponding to the scripts: f00, f4, f5, f6 and f7.
TASK : The lexical element for which the decision tree learning
is to be done and whose test instances are to be disambiguated.
STOP-LIST : text file of stop words. duluthCombo calls these
support scripts with the STOP-LIST with which it was run.
TOKEN : text file of token definitions.
NONTOKEN : text file of non-token definitions.
MINCOUNT : the minimum number of times the features must exist in
the training data for them to be selected.
PARSEFEATURES : specifies the parse features which are to be used
to create the parse features decision tree.
POSFEATURES : specifies the part of speech features which are to
be used in the decision tree.
Note: Except TASK, all the other arguments have the same
significance as in duluthCombo.
7.1.1 f00: f00 sets the cut-off score of the statistic used by
statistic.pl to identify suitable bigram features. This script is a
minor variation of the f0 script in DuluthShell. The `if' statement is
changed to accommodate additional command line arguments. Also,
gram-bi is called instead of gram2 so that count.pl may use the
nontoken file as well.
7.1.2 f4: f4 is similar to f00 except that it calls gram-uni which
identifies suitable unigram features.
7.1.3 f5: f5 is used to capture surface form features of the target
word. It calls learnSurf.pl to identify the various surface forms of
the target word and surf2regex.pl which creates appropriate regexes
corresponding to these surface form features.
7.1.4 f6: f6 may be used to capture part of speech features. It calls
learnPOS.pl which identifies the specified part of speech features
from the training data and pos2regex.pl which creates appropriate
regexes corresponding to these part of speech features.
7.1.5 f7: f7 may be used to capture parse features. It calls
choose.pl.pl which calls different programs depending on which parse
features are to be used. It then calls parse2regex.pl which creates
appropriate regexes corresponding to these parse features.
7.1.6 gram-bi: This script calls count.pl and statistic.pl to identify
bigram features from the training data. It then calls nsp2regex.pl to
create appropriate regexes corresponding to these bigram features.
Note, this script is a minor variation of gram2 in DuluthShell in that
it also calls joinheads.pl to remove white space (if present) between
the and tags.
7.1.7 gram-uni: This script is similar to gram-bi, except that
count.pl is used to identify unigram features from the training data.
7.1.8 tasksbypos: This script may be used to find disambiguation
scores per part of speech, that is, the accuracy in disambiguating all
instances which have noun target words, the accuracy in disambiguating
all instances which have verb target words, and so on. The script
finds scores for nouns, verbs, adjectives and indeterminates (lexcical
elements with instances having multiple parts of speech or whose part
of speech is not known). It has the following usage:
Usage: tasksbypos LexSample SCOREDIR
LexSample is the directory which has the sense-tagged data. SCOREDIR
is the directory which has the final score file (suffixed by .score)
created by duluthCombo.
In order to determine the overall noun score the script calls
splitpos.pl with all noun lexical elements as command line
arguments. The accuracy as calculated by splitpos.pl is then placed in
the file Nouns.acc which is created in OUTPUT directory (specified
while running duluthCombo). Verb, Adjective and indeterminate scores
are determined in a similar fashion and their accuracies placed in
Verbs.acc, Adjectives.acc and Indeterminates.acc. Note, these
accuracy files are created only if there are instances corresponding
to the particular part of speech in the test data.
7.2 Programs:
-------------
7.2.1 learnSurf.pl: This program identifies the various surface forms
of the target word in the given SOURCE files. An entry is made in the
OUTPUT file for each such unique surface form. There is one entry per
line. An entry consists of "TGT", space and the unique surface form
of the target word. "TGT" stands for target word.
USAGE : learnSurf.pl [OPTIONS] OUTPUT SOURCE
OPTIONS: --mincount MIN the feature must occur at least MIN times to
qualify. Default value is 0. --help Prints the help message.
--version Prints the version number.
7.2.2 learnPOS.pl: This program identifies POS features from a given
training file, SOURCE. The user specifies the positions of the
features he is interested in, relative to the target word. L1 stands
for the pos feature one position to the left of the target word. Ln
stands for the pos feature n positions to the left. Similarly, R1, R2,
... , Rn. The program captures the pos occurring at the positions
specified and the number of times they occur at that position. Those
parts of speech which occur more than a user specified number of times
(MIN), along with the corresponding position constitute a learned
feature. This feature, position and pos, are then printed out to the
OUTPUT file. Each entry in OUTPUT corresponds to one feature. The
position and part of speech are space separated. A POSITION may be a
combination of two individual positions, say R1 and R2. Then the
program learns features such as R1 is a determiner AND R2 is a noun.
Such combination POSITIONS are to be specified as follows - A:R1-R2
Here, A: stands for a combination feature (And). The positions
combined are hyphen separated.
USAGE : learnPOS.pl [OPTIONS] OUTPUT SOURCE POSITION [POSITION]...
OPTIONS:
--mincount MIN the feature must occur at least MIN times to
qualify. Default value is 0.
--help Prints the help message.
--version Prints the version number.
7.2.3 choose.pl: This program calls the appropriate parse feature
learning programs as specified by PARSEFEATURES which are hyphen
separated digits which map to the following parse features:
1 head word of phrase housing the : learnParse1.pl
target word
2 phrase of target word : learnParse2.pl
3 head of parent phrase : learnParse3.pl
4 parent phrase of target word : learnParse3.pl
0 no parse feature : No program called
Thus, if we wanted to use just the head and parent words,
PARSEFEATURES would be 1-3. Note the programs corresponding to
identifying the different kinds of parse features are also listed
above.
USAGE : choose.pl [OPTIONS] PARSEFEATURES
OPTIONS:
--word target word. Mandatory.
--mincount any parse feature must must exist more than this
minimum number in training data. Defaults to 0.
--version Prints the version number.
--help Prints help message.
7.2.4 learnParse1.pl, learnParse2.pl, learnParse3.pl and
learnParse4.pl: As stated above, these programs identify different
kinds of parse features (also stated above) from the training data
SOURCE file. Entries are made in the OUTPUT file for each of learned
features. There is one entry per feature. Each entry is made of
, space and the particular feature. corresponding to the different programs are listed below:
Program Example Entry
------- -------------------- -------------
learnParse1.pl HD (for head word) HD art
learnParse2.pl HDpos (pos of Head) HDpos N
learnParse3.pl PHD (Parent of Head) PHD school
learnParse4.pl PHDpos (pos of Parent) PHD N
All these programs have a common usage:
USAGE : learnParse.pl [OPTIONS] OUTPUT SOURCE
OPTIONS:
--mincount MIN the feature must occur at least MIN times to
be selected. Default 0.
--help Prints the help message.
--version Prints the version number
7.2.5 surf2regex.pl: This program creates regexes corresponding to the
surface form features listed in the SOURCE file. Each line in the
SOURCE file corresponds to one feature. Each entry consists of `TGT',
space and a surface form (type). TGT stands for the surface form of
the target word. The regex created if matched with any line suggests
that the line has the feature i,e the line has target word in the
particular surface form.
USAGE : surf2regex.pl [OPTIONS] OUTPUT SOURCE
OPTIONS:
--help Prints the help message
--version Prints the version number
7.2.6 pos2regex.pl: This program creates regexes corresponding to the
features listed in the SOURCE file. Each line in the SOURCE file
corresponds to one feature. It has the position a space and the part
of speech. The regex created if matched with any line suggests that
the line has the feature i,e the line has the part of speech at the
position.
USAGE : pos2regex.pl [OPTIONS] OUTPUT SOURCE
OPTIONS:
--help Prints the help message
--version Prints the version number
7.2.7 parse2regex.pl: This program creates regexes corresponding to
the parse features listed in the SOURCE file. Each line in the SOURCE
file corresponds to one feature. Each feature entry consists of the
broad category such as head word, parent phrase etc of the target word
and a specific word or phrase which is the head word or parent phrase
and so on. The regex created if matched with any line suggests that
the line has the feature.
USAGE : parse2regex.pl [OPTIONS] OUTPUT SOURCE
OPTIONS:
--help Prints the help message
--version Prints the version number
7.2.8 joinheads.pl: This program removes the whitespace between the
tag and the target word AND between the target word and the
tag.
USAGE : joinheads.pl [OPTIONS] DESTINATION SOURCE
--version Prints the version number
--help Prints help message
7.2.9 splitpos.pl: This program prints out precision value averaged
for each of the tasks specified as command line arguments. It picks up
the results for each individual lexelt from the .scores file SCORES.
Note: By task we mean instance ID of the test instance.
USAGE : splitpos.pl [OPTIONS] SCORES TASK1 [TASK2 ...] > OUTPUT
--version Prints the version number
--help Prints help message
7.210 red.pl: Discussed in Section 5.
8. IMPORTANT OUTPUT FILES:
==========================
Heres a brief discussion of some of the significant files created by
duluthCombo.
The .scores file has the result of comparing the manually assigned
sense and the automatic sense assigned by the system, for all the test
instances. The overall precision and recall are stated at the end of
the file.
The .dist file has the probability assigned to each of the senses of
the target word for a given instance.
The .answers file has a list of senses assigned by the system to each
of the test instances. It is that sense with the highest probability.
If two or more of the top senses are assigned probabilities within a
certain threshold, all such senses are assigned.
Nouns.acc, Verbs.acc, Adjectives.acc and Indeterminates.acc (if
created) have overall accuracies for the respective syntactic
categories.
The .scores files are also created within the directory for each
lexical element. The precision of disambiguation for instances
belonging to a lexelt is stated at the end of this file. Note: Ignore
recall values in such individual lexelt .scores file as they represent
the ratio of number instances of a particular lexelt which are
correctly disambiguated to the total number of instances in all
lexelts (instead of the particular lexelt). These .scores files are
created just to give the precision on individual lexelts.
9. Copying:
===========
This suite of programs is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
Note: The text of the GNU General Public License is provided in the
file GPL.txt that you should have received with this distribution.