README ------ SyntaLex Version 0.1 Copyright (C) 2001-2002 Mohammad Saif, smm@cs.toronto.edu University of Toronto Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth ##################### LAST UPDATED: Feb, 2004 ####################### SyntaLex is a perl package to do word sense disambiguation of data in Senseval-2 data format, using various lexical and syntactic features. We also provide a perl program which calculates various measures that may be used to determine the redundancy and complementarity of two sets of classifications. ###################################################################### 1. INTRODUCTION: ================ Word sense disambiguation is the process of identifying the correct sense, or meaning, of a word based on its context. Context may be defined as the sentence housing the target word and possibly one or more sentences surrounding it. The phenomenon of a word having multiple meanings, or senses, is known as polysemy and such words are said to be polysemous. For example, consider the noun `canteen'. It has two senses: `a flask to carry liquids' and `a cafeteria'. It is thus polysemous. Usually the intended sense of a polysemous word may be identified based on its context. For example: The prisoner was abandoned in the desert with just one canteen of water In the above sentence, `canteen' has clearly been used in the `flask to carry liquids' sense. While this is obvious to humans, creating an automated system which identifies the intended sense of a word based on its context remains a challenge. Numerous lexical and syntactic clues, or features, of the context may be used for automated word sense disambiguation. Lexical features such as surface form of a word, unigrams and bigrams are features which may be readily captured from raw text. Features which rely on a more enriched corpus such as a part of speech tagged and/or parsed text are known as syntactic features. SyntaLex is a perl implementation to do word sense disambiguation using individual or a combination of lexical and syntactic features. It uses Waikato Environment for Knowledge Analysis' (WEKA's) C4.5 decision tree learning algorithm to learn decision trees. Only slight variations in the script will allow use of other learning algorithms. The package uses the N-gram Statistics Package (NSP) to capture unigrams and bigram features. The package is an extension of the Duluth Shell system and utilizes certain perl programs from it and the SenseTools package. 2. LOCATION OF FILES: ===================== Unpacking this package creates the SyntaLex directory within which all files and directories exist. All shell scripts are in the `Scripts' directory while all perl programs are in the `Perl' directory. Thus, if $HOME represents the directory you have unpacked the package in, all shell scripts will be in $HOME/SyntaLex/Scripts and all the Perl programs in $HOME/SyntaLex/Perl. $HOME/SyntaLex/DataSensitive has certain files which may be used by the master scripts. These files are sensitive to the data being used. Depending on which data is being used, appropriate files are to be copied from $HOME/SyntaLex/DataSensitive to $HOME/SyntaLex. Modified versions of these files or completely different files may be used. The files in this directory are provided for the users convenience. `duluthCombo' utilizes the following text files: fine.key, stop.list, token.txt, nontoken.txt and sensemap. All these files must be present in $HOME/SyntaLex. `fine.key' and `sensemap' depend on the data being used. Files corresponding to Senseval-2, Senseval-1, "line", "hard", "serve" and "interest" are provided in the $HOME/SyntaLex/DataSensitive directory. They are named as follows: Sense-Tagged Data fine.key sensemap ----------------- -------- -------- Senseval-2 data Senseval2-fine.key Senseval2-sensemap Senseval-1 data Senseval1-fine.key dummy-sensemap "line" data line-fine.key dummy-sensemap "hard" data hard-fine.key dummy-sensemap "serve" data serve-fine.key dummy-sensemap "interest" data interest-fine.key dummy-sensemap As a default, $HOME/SyntaLex has files corresponding to Senseval-2 data. Move in appropriate files from $HOME/SyntaLex/DataSensitive for experiments with other data. $HOME/SyntaLex/LexSample must have the test and training data. The files corresponding to each lexical element must be placed in a directory by the name of the lexical element. These directory names must be suffixed by `.s' (as in Senseval-2 data) or `-s' (as in Senseval-1 data), where, `s' stands for the syntactic category corresponding to the lexical element. `s' may take any of the following forms: `n' for nouns, `v' for verbs, `a' for adjectives and `p' for indeterminates (lexical elements whose data instances belong to more than one or unknown part of speech). The directory for each lexical element must have the test and training xml and count files. To exemplify the above requirements, the $HOME/SyntaLex/LexSample may have directories named: `art.n', `authority.n', `begin.v' and `blind.a'. `art.n' and `authority.n' correspond to noun lexical elements, `begin.v' corresponds to verb instances of `begin' and `blind.a' corresponds to adjective instances of `blind'. `art.n' itself must have the following files: art.n-test.count, art.n-test.xml, art.n-training.count and art.n-training.xml. Similar files must exist in all other lexelt directories. 3. NECESSARY SOFTWARE: ====================== The following packages must be downloaded and unpacked before running the master scripts. a. SyntaLex: Download and unpack the SyntaLex package. It is available at: http://www.cs.toronto.edu/~smm/research.html Let the directory where SyntaLex is unpacked be $HOME. The package unpacks into the directory SyntaLex within which all other files are created. Thus, all the SyntaLex files are within $HOME/SyntaLex. The following paths and environment variable are to be set: setenv SyntaLexHOME $HOME/SyntaLex set path = ($SyntaLexHOME $path) set path = ($SyntaLexHOME/Scripts $path) set path = ($SyntaLexHOME/Perl $path) b. SenseTools: The SenseTools package is to be downloaded from: http://www.d.umn.edu/~tpederse/sensetools.html We used version 0.3. Any version above will be compatible, as well. Let the directory where SenseTools is unpacked be $HOME. A directory named as per the version of SenseTools is created here and all other files are created within it. Thus, I have all SenseTools files in $HOME/SenseTools-0.3 The following path and environment variable are to be set: setenv SENSEHOME $HOME/SenseTools-0.3 set path = ($SENSEHOME $path) c. WEKA: A stable version of WEKA is to be installed. We have used weka-3-2-3 (GUI version) and any version above it should be compatible, as well. It is available at: http://www.cs.waikato.ac.nz/ml/weka/ The files may be unpacked by the following commands: jar -xvf weka-3-2-3.jar jar -xvf weka.jar jar -xvf weka-src.jar Note: weka.jar and weka-src.jar are created within weka-3-2-3 directory, which is created by the first jar command. Let the directory where WEKA is unpacked be $HOME. Then the a directory named as per the version of weka used, is created here and all the weka files and subdirectories are created within it. Thus, I have all the files in $HOME/weka-3-2-3. The following paths and environment variables must be set. setenv WEKAHOME $HOME/weka-3-2-3 set path = ($WEKAHOME $path) setenv CLASSPATH $WEKAHOME/weka.jar:SENSEHOME d. DuluthShell: Install DuluthShell from: http://www.d.umn.edu/~tpederse/senseval2.html Like the two packages above, unpacking DuluthShell creates a directory named as per its version. I have used version 0.3. Thus, all DuluthShell files will be within $HOME/DuluthShell-v0.3. Set paths as follows: set path = ($HOME/DuluthShell-v0.3/Scripts $path) set path = ($HOME/DuluthShell-v0.3/Perl $path) e. NSP: The N-gram Statistics Package is available at: http://www.d.umn.edu/~tpederse/nsp.html We have used version 0.59. All newer versions will be compatible, as well. Let the directory where NSP is unpacked be $HOME. Then the a directory named as per the version of NSP used, is created here and all the NSP files and subdirectories are created within it. Thus, I have all the NSP files in $HOME/nsp-v0.59. The following paths and environment variables must be set. setenv NSPHOME $HOME/nsp-v0.59 set path = ($NSPHOME $path) set path = ($NSPHOME/Measures $path) set path = ($NSPHOME/Utils $path) f. Perl, version 5.6.1 or above, and Python, version 2.2 or above, must be installed and their paths set. Note, you may have to change the first line of scorer.python ( a program in Duluth-Shell) to reflect the correct path of python. For example if python is installed in /usr/bin/python2.2 then the first line of scorer.python must be: #!/usr/bin/python2.2 All the perl programs assume the Perl software to be at `/usr/bin'. If this is not the case requisite changes will need to be made to the first line of the perl programs. An alternative to changing the code for this purpose is to alias `/usr/bin/perl' to the appropriate perl directory. For eg. if the perl software is located in the `usr/local/bin/perl' directory, the following command aliases `usr/bin/perl' to `/usr/local/bin/perl'. alias /usr/bin/perl /usr/local/bin/perl 4. SCRIPTS: =========== Two scripts wsd-setup and duluthCombo may be used to take you through the complete process of pre-processing, capturing features, learning decision trees, running them on the test data and evaluation of the sense tagging. Following are their usage details: a. wsd-setup: This script pre-processes the sense-tagged data in Senseval-2 data format to make it suitable for word sense disambiguation using WEKA. Its functions are as follows: 1. Create a new directory, say SETUP, which has the sense-tagged data and files needed by duluthCombo such as fine.key, sensemap, token and nontoken files. 2. Replace the percent symbol in the answer ID tag of a Senseval-2 data format training file with `~'. 3. Eliminate xml tags which suggest that the instance is a proper noun. For example: P stands for proper noun and is not another sense of the target word. This is to avoid considering P as a sense of the instance. Usage: wsd-setup SETUP SETUP : The new directory created by wsd-setup which holds all files needed by duluthCombo and the sense-tagged data. All these files are copied from $SyntaLexHOME. Only the sense-tagged data is pre-processed as above, the rest remain unchanged. The files copied are as follows: a. fine.key : This key file which has the manually tagged senses for the test data. b. sensemap : This file is used for finer grained distinctions among the senses. Further details of the key file and sensemap file may be found at: http://www.sle.sharp.co.uk/senseval2/Scoring/ c. token1.txt : This file is used to identify the valid tokens in the data, by the NSP package. Used in detecting bigram and unigram features. d. non-token.txt : This file is used to identify invalid tokens and remove them from consideration, by the NSP package. Used in detecting bigram and unigram features. e. stop.list : This file is used by NSP take care of non-content words while identifying bigrams and unigrams. f. LexSample : This directory must have the sense-tagged to be used for word sense disambiguation. It must have the count and xml files of test and training data of each lexical element withing a directory named as per the lexical element and suffixed with the appropriate part of speech corresponding to the instances as described in section 2. NOTE: $SyntaLexHOME by default has files for Senseval-2 data and the LexSample directory is kept empty. To run experiments on Senseval-2 data, all that needs to be done is to copy the data in LexSample and run the master scripts. To run experiments on any other data, copy the files mentioned above from the DataSensitive directory to $SyntaLexHOME. Here is what I would do to disambiguate `line' data. cp DataSensitive/line-fine.key ./fine.key cp DataSensitive/dummy-sensemap ./sensemap wsd-setup SETUP b. duluthCombo: Once we have the pre-processed sense-tagged data and relevant files as created by wsd-setup, duluthCombo may be used to perform word sense disambiguation using WEKA and individual or combination of lexical and syntactic features. Usage: duluthCombo FEATURES SETUP STOPLIST TOKEN NONTOKEN MINCOUNT OUTPUT PARSEFEATURES [POSFEATURES] FEATURES : set of features which are to be used for WSD. Decision trees are created corresponding to each feature set separated from the others by a hyphen (-). Following are some of the feature sets that may be specified: f00 Bigrams f4 Unigrams f5 Surface Form f6 Part of speech f7 Parse SETUP : directory created by wsd-setup which has all the relevant files and data. STOP-LIST : text file of stop words. This file must be in SETUP. stop.list is the file created by wsd-setup. TOKEN : text file of token definitions. This file must be in SETUP. token.txt is the file created by wsd-setup. NONTOKEN : text file of non-token definitions. This file must be in SETUP. nontoken.txt is the file created by wsd-setup. MINCOUNT : the minimum number of times the features must exist in the training data for them to be selected. OUTPUT : Output directory in which all output files will be moved to. This directory is created by the script and must not exist presently. PARSEFEATURES : specifies the parse features which are to be used to create the parse features decision tree. It is to be specified via hyphen separated binary bits of the form 4-3-2-1 such that a value of 1 suggests the use of particular parse feature while 0 implies that no parse feature will be used as part of decision tree. The bit positions correspond to the following parse features: 1 head word 2 phrase 3 head of parent phrase 4 parent phrase 0 No Parse Features POSFEATURES : specifies the positions of the part of speech features which are to be used. Individual positions are specified as L1, L2, L3 and so on or R1, R2, R3 and so on. L1 corresponds to the part of speech of words one position to the left of the target word, L2 the part of speech of words two positions to the left of the target word, and so on. R1 stands for the part of speech of the target words (First pos tag to the right of the head word). R2 corresponds to the pos of words one position to the right of the target word (Second pos tag to the right of the head word), R3 is pos of word two positions to the right of the target word (Third pos tag to the right of the head word), and so on. A position may be a combination of two individual positions, say R1 and R2. Then the program learns features such as R1 is a determiner AND R2 is a noun. Such combination positions are to be specified as follows - A:R1-R2 Here, A: stands for a combination feature (And). The positions combined are hyphen separated. NOTE: Since all output files are moved to OUTPUT, you may run wsd-setup just once but run duluthCombo a multiple number of times with varying parameters. NOTE: stop.list, token.txt, and nontoken.txt are the default files created by wsd-setup which may/may-not be used by duluthCombo. If you wish to use other files, place the files in SETUP. `sample-runs' shows how I have used the two master scripts to disambiguate Senseval-2 data. Note, there are many other combinations of lexical nd syntactic features that may be used. The commands in this file are just to depict how one may use these master scripts. Once the paths are set, the two master scripts may be run from any directory. 5. Complementarity and Redundancy: ================================== We provide the perl program `red.pl' which may be used to compare two disparate sets of classifications of the same test data. Specifically, the values provide insight into the complementarity and redundancy of the systems which produce the two classifications. The program takes as input GOLD which gives the intended sense for given test instances. It also takes the files SYSTEM1 and SYSTEM2 which were produced by two systems intended to do word sense disambiguation. SYSTEM1 and SYSTEM2 have the classifications of the test instances as done by the individual systems. The program calculates and prints the following: (a) Total number of instances in GOLD. (b) Number of instances attempted - system 1. (c) Number of instances attempted - system 2. (d) Number of instances correct - system1. (e) Number of instances correct - system2. For each instance, each sense id suggested by the system is matched with the sense ids provided for that instance in GOLD. If for every match we increment x by 1 and the system suggests y senses in all, the score for that instance is x/y. This score is added up for all the instances to determine (symbolically) the total number of instances disambiguated correctly by the system. (f) Number of instances correctly tagged by either system1 or system2. A cumulative sum of the maximum scores for each instance as attained by the two systems determines (f). (g) Optimal Ensemble. The ratio of the number of instances correctly tagged by either system1 or system2 to the total number of instances is the Optimal Ensemble. (h) Number of instances tagged the same by systems 1 and 2 (similarity in tagging by dice). The senses assigned to each instance by the two systems are compared. Let x be the number of senses assigned by both systems to an instance. Let y and z be the total number of senses assigned to it by the two systems individually. The similarity of tagging for the instance is then determined by the formula: 2*x/(y+z) This similarity summed over all the instances yields effectively, the number of instances tagged alike by the two systems. A percentage value based on the total number of instances is also provided. (i) Agreement Number of instances that are assigned at least one sense in common by Systems 1 and 2. A percentage value based on the total number of instances is also provided. (j) A subset of the instances which satisfy (i) that are correct. For each instance that satisfies the condition in (i), a correctness score is calculated. Let y be incremented every time a sense, say s, provided by system 1 is also provided by system 2. Let x be incremented only when the previous condition is met and s is provided in GOLD for that instance. Then the correctness score is defined as x/y. A cumulative sum of the correctness score for all the instances that satisfy the condition specified in (i) produces (j). (k) Precision on Agreement. If (j) evaluates to x and (i) evaluates to y, Precision on Agreement is defined to be x/y. (l) Conservative ensemble. The ratio of instances correctly tagged by both systems to the total number of instances is defined to be the Conservative Ensemble. The program has the following usage. USAGE : red.pl [OPTIONS] GOLD SYSTEM1 SYSTEM2 OPTIONS: --version Prints the version number. --help Prints help message. 6. wsd-setup DETAILS: ===================== Apart from creating the SETUP directory and copying relevant files, wsd-setup calls the following programs to pre-process the sense-tagged data: 6.1 percent2tilda.pl: This program replaces the percent symbol in the answer ID tag of a Senseval-2 data format training file SOURCE, with `~'. The output is printed on stdout and may be captured by redirection. USAGE : percent2tilda.pl [OPTIONS] SOURCE OPTIONS: --version Prints the version number. --help Prints help message. 6.2 removePtag.pl: This program takes a Senseval-2 format training file as input (SOURCE). It eliminates those xml tags which suggest that the instance is a proper noun. For example: P stands for proper noun and is not another sense of the target word. USAGE : removePtag.pl [OPTIONS] DESTINATION SOURCE OPTIONS: --version Prints the version number. --help Prints help message. 6. duluthCombo DETAILS: ======================= duluthCombo calls upon various programs and support scripts which in turn call other scripts and programs. Following is description of the scripts and programs: 7.1 Support Scripts: -------------------- The scripts discussed in 7.1.1 through 7.1.7 have a common usage described below: Usage: f TASK STOPLIST TOKEN NONTOKEN MINCOUNT PARSEFEATURES [POSFEATURES] NUM : NUM is an element of the set {00, 4, 5, 6, 7}, corresponding to the scripts: f00, f4, f5, f6 and f7. TASK : The lexical element for which the decision tree learning is to be done and whose test instances are to be disambiguated. STOP-LIST : text file of stop words. duluthCombo calls these support scripts with the STOP-LIST with which it was run. TOKEN : text file of token definitions. NONTOKEN : text file of non-token definitions. MINCOUNT : the minimum number of times the features must exist in the training data for them to be selected. PARSEFEATURES : specifies the parse features which are to be used to create the parse features decision tree. POSFEATURES : specifies the part of speech features which are to be used in the decision tree. Note: Except TASK, all the other arguments have the same significance as in duluthCombo. 7.1.1 f00: f00 sets the cut-off score of the statistic used by statistic.pl to identify suitable bigram features. This script is a minor variation of the f0 script in DuluthShell. The `if' statement is changed to accommodate additional command line arguments. Also, gram-bi is called instead of gram2 so that count.pl may use the nontoken file as well. 7.1.2 f4: f4 is similar to f00 except that it calls gram-uni which identifies suitable unigram features. 7.1.3 f5: f5 is used to capture surface form features of the target word. It calls learnSurf.pl to identify the various surface forms of the target word and surf2regex.pl which creates appropriate regexes corresponding to these surface form features. 7.1.4 f6: f6 may be used to capture part of speech features. It calls learnPOS.pl which identifies the specified part of speech features from the training data and pos2regex.pl which creates appropriate regexes corresponding to these part of speech features. 7.1.5 f7: f7 may be used to capture parse features. It calls choose.pl.pl which calls different programs depending on which parse features are to be used. It then calls parse2regex.pl which creates appropriate regexes corresponding to these parse features. 7.1.6 gram-bi: This script calls count.pl and statistic.pl to identify bigram features from the training data. It then calls nsp2regex.pl to create appropriate regexes corresponding to these bigram features. Note, this script is a minor variation of gram2 in DuluthShell in that it also calls joinheads.pl to remove white space (if present) between the and tags. 7.1.7 gram-uni: This script is similar to gram-bi, except that count.pl is used to identify unigram features from the training data. 7.1.8 tasksbypos: This script may be used to find disambiguation scores per part of speech, that is, the accuracy in disambiguating all instances which have noun target words, the accuracy in disambiguating all instances which have verb target words, and so on. The script finds scores for nouns, verbs, adjectives and indeterminates (lexcical elements with instances having multiple parts of speech or whose part of speech is not known). It has the following usage: Usage: tasksbypos LexSample SCOREDIR LexSample is the directory which has the sense-tagged data. SCOREDIR is the directory which has the final score file (suffixed by .score) created by duluthCombo. In order to determine the overall noun score the script calls splitpos.pl with all noun lexical elements as command line arguments. The accuracy as calculated by splitpos.pl is then placed in the file Nouns.acc which is created in OUTPUT directory (specified while running duluthCombo). Verb, Adjective and indeterminate scores are determined in a similar fashion and their accuracies placed in Verbs.acc, Adjectives.acc and Indeterminates.acc. Note, these accuracy files are created only if there are instances corresponding to the particular part of speech in the test data. 7.2 Programs: ------------- 7.2.1 learnSurf.pl: This program identifies the various surface forms of the target word in the given SOURCE files. An entry is made in the OUTPUT file for each such unique surface form. There is one entry per line. An entry consists of "TGT", space and the unique surface form of the target word. "TGT" stands for target word. USAGE : learnSurf.pl [OPTIONS] OUTPUT SOURCE OPTIONS: --mincount MIN the feature must occur at least MIN times to qualify. Default value is 0. --help Prints the help message. --version Prints the version number. 7.2.2 learnPOS.pl: This program identifies POS features from a given training file, SOURCE. The user specifies the positions of the features he is interested in, relative to the target word. L1 stands for the pos feature one position to the left of the target word. Ln stands for the pos feature n positions to the left. Similarly, R1, R2, ... , Rn. The program captures the pos occurring at the positions specified and the number of times they occur at that position. Those parts of speech which occur more than a user specified number of times (MIN), along with the corresponding position constitute a learned feature. This feature, position and pos, are then printed out to the OUTPUT file. Each entry in OUTPUT corresponds to one feature. The position and part of speech are space separated. A POSITION may be a combination of two individual positions, say R1 and R2. Then the program learns features such as R1 is a determiner AND R2 is a noun. Such combination POSITIONS are to be specified as follows - A:R1-R2 Here, A: stands for a combination feature (And). The positions combined are hyphen separated. USAGE : learnPOS.pl [OPTIONS] OUTPUT SOURCE POSITION [POSITION]... OPTIONS: --mincount MIN the feature must occur at least MIN times to qualify. Default value is 0. --help Prints the help message. --version Prints the version number. 7.2.3 choose.pl: This program calls the appropriate parse feature learning programs as specified by PARSEFEATURES which are hyphen separated digits which map to the following parse features: 1 head word of phrase housing the : learnParse1.pl target word 2 phrase of target word : learnParse2.pl 3 head of parent phrase : learnParse3.pl 4 parent phrase of target word : learnParse3.pl 0 no parse feature : No program called Thus, if we wanted to use just the head and parent words, PARSEFEATURES would be 1-3. Note the programs corresponding to identifying the different kinds of parse features are also listed above. USAGE : choose.pl [OPTIONS] PARSEFEATURES OPTIONS: --word target word. Mandatory. --mincount any parse feature must must exist more than this minimum number in training data. Defaults to 0. --version Prints the version number. --help Prints help message. 7.2.4 learnParse1.pl, learnParse2.pl, learnParse3.pl and learnParse4.pl: As stated above, these programs identify different kinds of parse features (also stated above) from the training data SOURCE file. Entries are made in the OUTPUT file for each of learned features. There is one entry per feature. Each entry is made of , space and the particular feature. corresponding to the different programs are listed below: Program Example Entry ------- -------------------- ------------- learnParse1.pl HD (for head word) HD art learnParse2.pl HDpos (pos of Head) HDpos N learnParse3.pl PHD (Parent of Head) PHD school learnParse4.pl PHDpos (pos of Parent) PHD N All these programs have a common usage: USAGE : learnParse.pl [OPTIONS] OUTPUT SOURCE OPTIONS: --mincount MIN the feature must occur at least MIN times to be selected. Default 0. --help Prints the help message. --version Prints the version number 7.2.5 surf2regex.pl: This program creates regexes corresponding to the surface form features listed in the SOURCE file. Each line in the SOURCE file corresponds to one feature. Each entry consists of `TGT', space and a surface form (type). TGT stands for the surface form of the target word. The regex created if matched with any line suggests that the line has the feature i,e the line has target word in the particular surface form. USAGE : surf2regex.pl [OPTIONS] OUTPUT SOURCE OPTIONS: --help Prints the help message --version Prints the version number 7.2.6 pos2regex.pl: This program creates regexes corresponding to the features listed in the SOURCE file. Each line in the SOURCE file corresponds to one feature. It has the position a space and the part of speech. The regex created if matched with any line suggests that the line has the feature i,e the line has the part of speech at the position. USAGE : pos2regex.pl [OPTIONS] OUTPUT SOURCE OPTIONS: --help Prints the help message --version Prints the version number 7.2.7 parse2regex.pl: This program creates regexes corresponding to the parse features listed in the SOURCE file. Each line in the SOURCE file corresponds to one feature. Each feature entry consists of the broad category such as head word, parent phrase etc of the target word and a specific word or phrase which is the head word or parent phrase and so on. The regex created if matched with any line suggests that the line has the feature. USAGE : parse2regex.pl [OPTIONS] OUTPUT SOURCE OPTIONS: --help Prints the help message --version Prints the version number 7.2.8 joinheads.pl: This program removes the whitespace between the tag and the target word AND between the target word and the tag. USAGE : joinheads.pl [OPTIONS] DESTINATION SOURCE --version Prints the version number --help Prints help message 7.2.9 splitpos.pl: This program prints out precision value averaged for each of the tasks specified as command line arguments. It picks up the results for each individual lexelt from the .scores file SCORES. Note: By task we mean instance ID of the test instance. USAGE : splitpos.pl [OPTIONS] SCORES TASK1 [TASK2 ...] > OUTPUT --version Prints the version number --help Prints help message 7.210 red.pl: Discussed in Section 5. 8. IMPORTANT OUTPUT FILES: ========================== Heres a brief discussion of some of the significant files created by duluthCombo. The .scores file has the result of comparing the manually assigned sense and the automatic sense assigned by the system, for all the test instances. The overall precision and recall are stated at the end of the file. The .dist file has the probability assigned to each of the senses of the target word for a given instance. The .answers file has a list of senses assigned by the system to each of the test instances. It is that sense with the highest probability. If two or more of the top senses are assigned probabilities within a certain threshold, all such senses are assigned. Nouns.acc, Verbs.acc, Adjectives.acc and Indeterminates.acc (if created) have overall accuracies for the respective syntactic categories. The .scores files are also created within the directory for each lexical element. The precision of disambiguation for instances belonging to a lexelt is stated at the end of this file. Note: Ignore recall values in such individual lexelt .scores file as they represent the ratio of number instances of a particular lexelt which are correctly disambiguated to the total number of instances in all lexelts (instead of the particular lexelt). These .scores files are created just to give the precision on individual lexelts. 9. Copying: =========== This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution.