README ------ refine Version 0.2 Copyright (C) 2001-2002 Mohammad Saif, moha0149@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth ##################### LAST UPDATED: Dec, 2003 ########################### `refine' is a software package intended to process data in Senseval-2 data format, to make it better suited as input to various systems. It may be used to pre-tag the head words and/or restore split sentences and/or place sentences on new lines. ############################# QUICK RUN ################################# # # 1. Download and unpack `refine': # # http://www.d.umn.edu/~moha0149/research.html # # Set environment variable and path as shown: # # setenv REFINEHOME $REFPACKAGE/refine # set path = ($REFINEHOME $path) # # ...where $REFPACKAGE represents the directory where the package # has been unpacked. # # 2. Type: # proc DATA 2 1 2 1 1 OUTPUT SOURCE # # DATA specifies the type of data being processed. This # information is used by the script to select appropriate # pre-tagging and/or split sentence files. It can take any of # the following numeric values: # # 1 Senseval-1 training data in Senseval-2 data format # 2 Senseval-1 evaluation data in Senseval-2 data format # 3 Senseval-2 training data # 4 Senseval-2 evaluation data # 5 Any other data in Senseval-2 data format # # PLEASE SEE DETAILED USAGE (BELOW) FOR A BETTER USAGE OF THE SCRIPT # AS PER YOUR NEEDS. # ########################################################################### ################################ DETAILS ################################## 1. INTRODUCTION: =============== The Senseval-2 exercise, held in July 2001, brought together numerous word sense dismabiguation systems which were trained and tested on one common data set and thereby one data format - the Senseval-2 data format. Since then various other systems have been designed to accept data in this format. The "line", "hard", "serve" and "interest" data are also available in the Senseval-2 data format at the authors' web pages. However, the data has certain irregularities such as sentences split up be new line characters and certain other pairs of sentences which are not separated by new lines. Various software tools and packages such as parsers and part of speech taggers require the data to abide by these conditions. Additionally, they may improve on their accuracy if the head words in the data are pre-tagged with the correct part of speech. This package aims at processing the data to abide by these conditions. Information about the Senseval-2 data format is available at its web page. http://www.sle.sharp.co.uk/senseval2 The original Senseval-1, Senseval-2 and the fixed Senseval-1 data may be downloaded from: Senseval-2 : http://www.sle.sharp.co.uk/senseval2/Results/guidelines.htm#rawdata Senseval-1 : http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html Fixed Senseval-1: http://www.d.umn.edu/~moha0149/research.html 2. SCRIPTS: ========== A script `proc' is provided with this package which takes you through the complete process of appropriately processing any data in Senseval-2 data format. proc: To pre-tag the head words and/or restore split sentences and/or place sentences on new lines. Files with the necessary pre-tagging information have been provided for the Senseval-1, Senseval-2, line, hard, serve and interest data. Any other data in Senseval-2 data format may be pre-tagged as well after the creation of their corresponding pre-tagging files. Details in Section 6.1. Files with the necessary split sentence restoration have been provided for the Senseval-1 and Senseval-2 data. Split sentences in any other data which is in Senseval-2 data format may be restored after the creation of their corresponding restoration files. Details in Section 6.2. If the source data has sentence boundaries marked with and markers, these markers may be eliminated and each sentence placed on a new line using this script. proc may also be used to detect multiple sentences within a pair of new line characters and place each sentence on a new line. Details in Section 6.3. If contexts of certain instances are to be replaced with user specified contexts, proc may be used to do so. The user specifies the instances and contexts based on a CONTEXT file. Some of the sentences in Senseval-1, Senseval-2 and "serve" data have very long sentences (more than 120 tokens). Such sentences may be split into two (posiibly more) sentences by manual inspection. Th epackage provides CONTEXT files which replace contexts having long sentences by contexts which have manually split long sentences. Details in Section 6.4. 3. LOCATION OF FILES: ==================== On unpacking the package all files and data created are within the `refine' directory. This directory is created in the directory where the package was unpacked. Let REFINEHOME represent the complete path to `refine'. The scripts and all the perl programs as indicated by their `.pl' extension are located here. Various other file which have information regarding the following are also provided. pre-tagging information : $REFINEHOME/data/PRETAG/ split sentence information : $REFINEHOME/data/SPLIT/ context replacement information : $REFINEHOME/data/REPLACE/ May place input files at : $REFINEHOME/data/user/ Output may be placed at : $REFINEHOME/data/output/ These files are described in the sections to follow. The change in location of the above stated data files is possible but entails appropriate changes in the scripts. All the perl programs assume the Perl software to be at `/usr/bin'. If this is not the case requisite changes will need to be made to the first line of the perl programs. An alternative to changing the code for this purpose is to alias `/usr/bin/perl' to the appropriate perl directory. For eg. if the perl software is located in the `usr/local/bin/perl' directory, the following command aliases `usr/bin/perl' to `/usr/local/bin/perl'. alias /usr/bin/perl /usr/local/bin/perl 4. NECESSARY SOFTWARE: ===================== The following software must be downloaded and their locations placed in the PATH, in order to successfully part of speech tag data. 1. refine: http://www.d.umn.edu/~moha0149/research.html The complete path of the refine directory, created after unpacking the package, needs to be set as an environment variable in your .cshrc file. This is how I set it: setenv REFINEHOME $REFPACKAGE/refine ...where $REFPACKAGE represents the directory where the package has been unpacked. The $REFINEHOME directory must be added to the PATH as well. This is how I did it: set path = ($REFINEHOME $path) 5. REFINING THE DATA: ==================== The processing of data in Senseval-2 data format involves pre-tagging the head words and concatenation of split sentences using the script `proc'. If the part of speech tag usually associated with a morphological form of a type is known, all instances of that type which are head words can be pre-tagged with the associated part of speech. By pre-tag we mean that the word is tagged with a part of speech before the text is given to the tagger. Thus the tagger may use this tag to influence its tagging of the surrounding words. Particular instances of the head words may be pre-tagged to supersede the tag based on morphology. The necessary files to pre-tag Senseval-1, Senseval-2, "line", "hard", "serve" and "interest" data is provided with this package. Pre-Tagging files corresponding to other data in Senseval-2 data format may be created and used as well. Details in Section 6.1. The part of speech of the head words in Senseval-1 and Senseval-2 data is known. Although finer distinctions such as common noun, past participle etc are not given, the broader class such as Noun, Verb etc is specified. This information or a refinement of the same may be used to pre-tag the head words. We are also aware that head words in the "line" data are nouns (NN), "hard" data instances are adjectives (JJ), and "interest" data instances are nouns (NN). This pre-tagging information is provided for the above mentioned data in the form of pre-tagging files. Senseval-1 and Senseval-2 data have certain sentences split up by newline characters. Such sentences have been manually identified and the relevant information stored in text files. Split sentence files corresponding to other data in Senseval-2 data format may be created and used as well. Details in Section 6.2. The `proc' script uses these pre-tagging and split sentence files to pre-tag the data and/or restore split sentences. Following is its usage: USAGE : proc DATA TAG RESTORE NEWLINE REPLACE UNICODE OUTPUT SOURCE DATA specifies the type of data being processed. This information is used by the script to select appropriate pre-tagging and/or split sentence files. It can take any of the following numeric values: 1 Senseval-1 training data in Senseval-2 data format 2 Senseval-1 evaluation data in Senseval-2 data format 3 Senseval-2 training data 4 Senseval-2 evaluation data 5 Any other data in Senseval-2 data format TAG is used to specify the level of pre-tagging to be done. It can take any of the following numeric values: 1 Pre-Tagging based on head word Morphology 2 Pre-Tagging based on head word Morphology superseded by specific instance Pre-Tagging 0 No Pre-tagging Information for morphology based Pre-Tagging of all head words of Senseval-1, Senseval-2, line, hard, serve and interest data is provided with this package. Please see Section 7.1 for information on how to customize. Information for specific instance Pre-Tagging of all capitalized noun head words of Senseval-1 and Senseval-2 data is provided with this package. Please see Section 7.1 for information on how to customize. RESTORE specifies if any split sentences are to be restored. It can take any of the following numeric values... 1 Restore split sentences 0 No split sentence restoration Information for split sentence restoration of Senseval-1 and and Senseval-2 data provided with this package. Please see Section 7.2 for information on how to customize NEWLINE specifies if sentences are to be placed on new lines. 1 Eliminate sentence boundary markers and (if present) and place the sentence on new line. 2 In addition to 1 stated above, check text within every pair of new line characters to see if consisting of multiple sentences. If yes, place each sentence on a new line. 0 No action REPLACE specifies if contexts of certain instances are to be replaced with modified/new contexts. The instances whose contexts are to be replaced and the modified/new contexts are specified via a CONTEXT file. 1 replace contexts 0 No action UNICODE specifies if special unicode characters are to be replaced by appropriate xml representations as listed in the file unicodemap.txt. 1 replace unicode characters 0 No action OUTPUT is the name of the directory where all the output files are to be created. The directory specified will be created by the script and should not be existing already. SOURCE is the name of the Senseval-2 data format file to be part of speech tagged. A complete or relative path where the file exists may be specified. Following files are placed in the OUTPUT directory: 1.txt : A copy of the SOURCE file. pretag.xml : The pre-tagged SOURCE file. Created only if pre-tagging is requested. join.xml : Split sentence restored SOURCE file. Created only if split sentence restoration requested. marked.xml : The file created on using 1 as the NEWLINE option multisent.xml : The file created on using 2 as the NEWLINE option replace.xml : The file created on using 1 as the REPLACE option unicode.xml : The file created on using 1 as the UNICODE option Pre-tagged Senseval-1 and Senseval-2 data files with split sentence restoration are available at the authors' web page. Senseval-1: Training file : pretag-train-S1.xml Test file : pretag-test-S1.xml Senseval-2: Training file : pretag-train-S2.xml Test file : pretag-test-S2.xml These file have been created using the following options respectively: proc 1 2 1 2 1 1 DESTINATION SOURCE proc 2 2 1 2 1 1 DESTINATION SOURCE proc 3 2 1 2 1 1 DESTINATION SOURCE proc 4 2 1 2 1 1 DESTINATION SOURCE 6. REFINE DETAILS: ================= Following is a description of the various perl programs used to process data in Senseval-2 data format. 6.1 PRE-TAGGING OF HEAD WORDS (pretag.pl): ----------------------------------------- `pretag.pl' pre-tags the head words of Senseval-2 data in a format acceptable by the Brill Tagger. Its usage is as follows: Usage: pretag.pl [OPTIONS] TAGS DESTINATION SOURCE [SOURCE]... OPTIONS: --help prints the help message --PNouns PNOUNS used to tag specific instances of the head word. Instance information stored in file PNOUNS. The program pre-tags the head words in the SOURCE file(s) with part of speech tags based on morphological form and the broad part of speech class already known from the data. TAGS refers to the file holding this information. This package provides files for the Senseval-1 and Senseval-2 evaluation and training data, the "line", "hard", "serve" and the "interest" data. The TAGS file is analogous to the LEXICON file of the Brill Tagger. Every line in the text is treated as a separate entry. Each entry consists of a particular type followed by its most likely part of speech tag. Data in Senseval-2 data format has an xml tag which provides the lexical task name and the part of speech of the following instances. For example: Here `art' is the name of the lexical task, while, the following instances of the task have the head words in the noun form - indicated by `n'. Given that the instances are nouns, only those morphological form entries in TAGS are considered which have a noun part of speech associated with them. Once a match in the surface form is found, the associated part of speech is chosen. Every token within the head tags is considered for pre-tagging. If there exists an entry for the head word in the TAGS file, then the head word is pre-tagged with the associated most likely tag. It may be noted that head words with apostrophe are first tokenized in the following manner by pretag.pl before pre-tagging. band's tokenized to... band 's pre-tagged to... band//NN 's In the TAG files provided, no entries for `'' exists hence it is not pre-tagged. It is left to the tagger to appropriately part of speech tag the apostrophe. The names of the pre-tagging files provided with this package are as follows: Senseval-1 training data : train-types-S1.txt Senseval-1 evaluation data : test-types-S1.txt Senseval-2 training data : train-types-S2.txt Senseval-2 evaluation data : test-types-S2.txt Other data : types-O.txt Location : $REFINEHOME/data/PRETAG Entries in the `types-O.txt' correspond to the "line", "hard", "serve" and "interest" data. This information may be supplemented or replaced to deal with other data. All head word instances with the same morphological form are tagged alike by the default option. Specific instances may have their pre-tag superseded by another user specified pre-tag, using the PNoun option. The pre-tag to be assigned to the instance and line number of the head word in the data file are to be specified in the PNOUNS file. The package provides PNOUNS files with encoded information for all capitalized noun head words in the Senseval-1 and Senseval-2 data. The context of these instances has been manually examined to determine the most suitable part of speech of the head word. The first token in every line is the instance being pre-tagged. This information is not used for pre-tagging. It is followed by its line number in the Senseval-2 data format file and the pre-tag. If no pre-tag is specified, a common noun(NN) is chosen by default. Question marks before the line number signify possibility of alternate pre-tag being appropriate. Again, the question marks do not affect processing. The names of the files provided with this package are as follows: Senseval-1 training data : train-nouns-S1.txt Senseval-1 evaluation data : test-nouns-S1.txt Senseval-2 training data : train-nouns-S2.txt Senseval-2 evaluation data : test-nouns-S2.txt Other data : nouns-O.txt In case of no specific instance pre-tagging : dummy.txt (blank file) Location : $REFINEHOME/data/PRETAG The `nouns-O.txt' is presently blank. It may be updated to do specific instance pre-tagging in data other than the Senseval-1 and Senseval-2 English Lexical Sample Space. The script runs the program with the following options: pretag.pl -PNouns $REFINEHOME/data/PRETAG/$nouns \\ $REFINEHOME/data/PRETAG/$types pretag.xml 1.txt 6.2 CONCATENATING SPLIT SENTENCES (join.pl): ------------------------------------------- A pre-requisite in using the Brill tagger is that parts of sentences should not have new line characters between them. The Senseval-1 and Senseval-2 data do not adhere to this requirement completely and other sources of data could have the same problem. `join.pl' may be used to concatenate the split sentences. Usage: join.pl [OPTIONS] LINES SOURCE DESTINATION OPTIONS: --help Prints the help message. The program concatenates lines of the SOURCE file as specified by their line numbers listed in the LINES file. `LINES' files containing this information for Senseval-1 and Senseval-2 data constructed by manual inspection are provided with this package. The output of the program is the DESTINATION file. The names of the files provided with this package are as follows: Senseval-1 training data : train-lines-S1.txt Senseval-1 evaluation data : test-lines-S1.txt Senseval-2 training data : train-lines-S2.txt Senseval-2 evaluation data : test-lines-S2.txt Other data : lines-O.txt Location : $REFINEHOME/data/SPLIT The `lines-O.txt' is presently blank. It may be created to do split sentence restoration in data other than the Senseval-1 and Senseval-2 English Lexical Sample Space. The script runs the program with the following options: join.pl $REFINEHOME/data/SPLIT/$lines pretag.xml join.xml 6.3 PLACING SENTENCES ON NEWLINES (mark.pl and multisent.pl): ------------------------------------------------------------ As mentioned above, we would like the tagger to receive a text file with all sentences on new lines. Certain data files might have sentence boundaries marked by the and markers and may have multiple sentences on the same line. mark.pl and multisent.pl are perl programs which handle these two cases respectively. mark.pl eliminates sentence boundary markers and (if present) in the SOURCE file. It then places each sentence on a new line. The output is placed in the DESTINATION file. Usage: mark.pl [OPTIONS] DESTINATION SOURCE OPTIONS: --help Prints the help message. The script runs the program with the following options: mark.pl marked.xml join.xml multisent.pl detects multiple sentences in each line of the SOURCE file and places each sentence on a new line. The output is placed in DESTINATION. The program assumes non-tokenized input. Usage: multisent.pl [OPTIONS] DESTINATION SOURCE OPTIONS: --help Prints the help message. The script runs the program with the following options: multisent.pl multisent.xml marked.xml 6.4 REPLACING CONTEXTS OF SPECIFIC INSTANCES (replace.pl): --------------------------------------------------------- This program may be used to replace the contexts of certain instances in a senseval-2 data format SOURCE file with modified/new contexts. The instances whose contexts are to be replaced and the modified/new contexts are specified via a CONTEXT file. The CONTEXT file must have the instance ID of the instance whose context is to be replaced followed by the modified/new context. The context must be demarcated by and tags. The instance ID, and <\/context> tags must be on new lines. Blank lines are allowed, however, everything between the context tags is considered part of context. The token(s) between the and tags of the context in the CONTEXT file are replaced by correponding tokens in the SOURCE file. If the original context has its head word pre-tagged with a part of speech, the updated context will thus have the pre-tag. The updated SOURCE is placed in the DESTINATION file. USAGE : replace.pl OPTIONS CONTEXT DESTINATION SOURCE OPTIONS: --version Prints the version number. --help Prints help message. If contexts of certain instances are to be replaced with user specified contexts, proc may be used to do so. The user specifies the instances and contexts based on a CONTEXT file. Some of the sentences in Senseval-1, Senseval-2 and "serve" data have very long sentences (more than 120 tokens). Such sentences may be split into two (posiibly more) sentences by manual inspection. Using replcae.pl contexts having such long sentences may be replaced with contexts having manually split sentences. CONTEXT files corresponding to the Senseval-1, Senseval-2 test and training data and "serve" data are provided with this package. These filese are named: Senseval-1 training data : train-rep-S1.txt Senseval-1 evaluation data : test-rep-S1.txt Senseval-2 training data : train-rep-S2.txt Senseval-2 evaluation data : test-rep-S2.txt Other data (serve data) : rep-O.txt Location : $REFINEHOME/data/REPLACE The `rep-O.txt' contains information for the "serve", "hard", "line" and "interest" data. It may be supplimented with information for any other data as long as the instance IDs are unique. The script runs the program with the following options: replace.pl CONTEXT replace.xml multisent.xml 6.5 REPLACING UNICODE CHARATERS WITH XML TAGS (unicode2xml.pl): --------------------------------------------------------------- unicode2xml.pl replaces special unicode characters as listed in the file UNIMAP by appropriate xml tags which are also listed in UNIMAP. UNIMAP is to have one entry per unicode character. Each entry is on one line and has the unicode character, white space and the corresponding xml representation, in that order. USAGE : unicode2xml.pl [OPTIONS] UNIMAP DESTINATION SOURCE This option is provided specially to handle certain occurrences of unicode characters in Senseval-2 data which should have been represented by xml tags, as is the format of Senseval-2 data. A default UNIMAP file which caters to Senseval-2 data is provided with this package (unicodemap.txt). If the data being used is different and has unicode characters, appropriate UNIMAP file must be created and copied to $REFINEHOME/unicodemap.txt. The script runs the program with the following options: unicode2xml.pl $REFINEHOME/unicodemap.txt unicode.xml replace.xml 7. Copying: =========== This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 8. REFERENCES: ============== 1. [Brill 92] E.Brill. A Simple Rule-Based Part of Speech Tagger. In Proceedings of the Third Conference on Applied Computational Linguistics, Trento, Italy, 1992. 2. [Brill94] E.Brill. Some Advances in Rule-Based Part of Speech Tagging. Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. 3. [Mohammad Pedersen 2002] S.Mohammad and T.Pedersen. Guaranteed Pre-Tagging for the Brill Tagger. In the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics(CICLing-2003), in Fenruary 2003, in Mexico City.