README ------ posSenseval Version 0.1 Copyright (C) 2001-2002 Mohammad Saif, moha0149@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth ##################### LAST UPDATED: May, 2003 ########################### posSenseval is a software package to part of speech tag any text in Senseval-2 data format. Brill Tagger[Brill 92][Brill 94] is utilized for the purpose. ############### TO IMMEDIATELY PART OF SPEECH TAG DATA #################### # # 1. Download and unpack: # a. posSenseval: # http://www.d.umn.edu/~moha0149/research.html # Set environment variable and path as shown: # setenv POSHOME $POSPACKAGE/posSenseval # set path = ($POSHOME $path) # ...where $POSPACKAGE represents the directory where the package # has been unpacked. # # b. SenseTools : # http://www.d.umn.edu/~tpederse/senseval2.html # Set PATH as shown: # set path = ($SENSEHOME/SenseTools-0.1 $path) # ...where $SENSEHOME represents the directory where Sensetools # is unpacked and given that the version being used is 0.1. # # c. Brill Tagger: # http://www.cs.jhu.edu/~brill # Compile the tagger. # # d. BrillPatch: # http://www.d.umn.edu/~moha0149/research.html # (Optional but suggested. Patch to be applied as per its README) # The application of this patch guarantees pre-tagging and applies # lexicalized contextual rules even when the data is pre-tagged. # # 2. Type: # pos BRILL OUTPUT SOURCE [tr] # # BRILL is the location where the Brill Taggers executable (tagger) # is located. # # OUTPUT is the name of the directory where all the output files # are to be created. The directory specified will be created by # the script and should not be existing already. # # SOURCE is the name of the Senseval-2 data format file which is # to be part of speech tagged. A complete or relative path where # the file exists may be specified. If `tr' is specified as the # fourth command line argument, the part of speech tagged # individual lexical element files are named as training files. # Else, they are named as test files. # ########################################################################### ################################ DETAILS ################################## 1. INTRODUCTION: =============== Assigning the appropriate part of speech to the words in a text is known as part of speech tagging. Consider the following sentence. Jack will chair the meeting It may be part of speech tagged as shown below, given, NNP stands for proper noun, MD for modal, VB for verb, DT for determiner and NN for noun. Jack/NNP will/MD chair/VB the/DT meeting/NN Numerous part of speech taggers are available commercially and in the public domain. Most of these tag all the tokens in the given input text. They expect the text to be tokenized and sentences separated by a marker (usually, a new line character). This however, is not always the case and is what motivates this package. Part of speech tagging is a pre-requisite for many Natural Language Tasks. Most of the techniques which utilize data for word sense disambiguation rely on the part of speech of the target word, its surrounding words and sometimes syntactically and lexically related words. Many chunkers and parsers which are used to obtain these syntactic relations utilize the part of speech tags for their functioning. The Senseval-2 exercise held in 2001, brought together numerous word sense disambiguation systems from all over the world on one platform. For the first time, the results from all these systems could be compared as the systems were trained and evaluated on a common sense tagged data set. An offshoot of which was that all the systems involved were designed to accept a common data format - the Senseval-2 data format. Data in Senseval-2 format has numerous xml and sgml tags which are not a part of the contextual information. The text need not be tokenized and may have parts of sentences separated by a new line character (split sentences). posSenseval is designed to efficiently part of speech tag any data in Senseval-2 data format using the Brill tagger, which as any standard tagger, has the requirements mentioned above. Creating manually annotated data like the Senseval-2 English Lexical Sample Space is both expensive and time intensive. Conversion of data in various data formats to the Senseval-2 data format can be a significant source of quality sense tagged data for the numerous Natural Language Tools which accept data in Senseval-2 format. The Sval1to2 package which converts data in Senseval-1 format to that in Senseval-2 data format is available at: http://www.d.umn.edu/~tpederse/data.html The lineOneTwo, hardOneTwo, serveOneTwo and interestOneTwo packages which convert the "line", "hard", "serve" and "interest" data respectively, into Senseval-1 data format and then use Sval1to2 to convert them to Senseval-2 data format are available at the authors' web page. Thus, there is a rich source of sense-tagged data available in the Senseval-2 data format which may be used for many of the Natural Language Processing tasks. posSenseval can be used to accurately assign parts of speech to all of them. 2. Senseval-2 data format: ========================= The English lexical sample space of Senseval-2 has 4328 instances of evaluation data and 8611 instances of training data. Each instance has a sense tagged sentence along with a few neighboring sentences forming its context. The data is partitioned into two sets - the training data and the evaluation data. In the evaluation data, one word in the context is marked (surrounded by
and ) as the target word, whose sense as intended in the context is to be disambiguated. It also has other xml tags (demarcated by angular braces) and certain sgml tags (placed in square braces). The training data has the same kind of tags but additionally has the intended sense of the target word. Information about the Senseval-2 data format is available at its web page. http://www.sle.sharp.co.uk/senseval2 3. SCRIPTS: ========== A script, `pos', is provided with this package which takes you through the complete process of efficiently part of speech tagging any data in Senseval-2 data format. Details in Sections 6 and 7. Data may be first processed by `refine' to improve the quality of the tagging done by `pos'. `refine' is a software package, available at the authors' web page, which may be used pre-tag the head words with appropriate part of speech tags and to place sentences on new lines. Higher accuracy of tagging warrants the following: 1. Symbols such as `<', `>', `[', `]', `{' and `}', which have special meaning in Senseval-1 or Senseval-2 data format, must be represented by appropriate character references in HTML4.0. http://www.cs.tut.fi/~jkorpela/html/guide/entities.html 2. There should not be new-line characters within angular braces. 3. Sentences must be separated from each other by new line characters. 4. Sentences should not have new line characters within them. 5. All data within square braces is assumed to be contextual if composed of multiple tokens and non-contextual otherwise (This is the case in Senseval-2 data which has SGML tags). The file `samplerun' shows how I have used pos to generate part of speech tagged line, hard, serve, interest, Senseval-1 and Senseval-2 data. 4. LOCATION OF FILES: ==================== On unpacking the package all files and data created are within the `posSenseval' directory. This directory is created in the directory where the package was unpacked. Let POSHOME represent the complete path to posSenseval. The scripts and all the perl programs as indicated by their `.pl' extension are located here. The original Senseval-1, Senseval-2 and the fixed Senseval-1 data may be downloaded from: Senseval-2 : http://www.sle.sharp.co.uk/senseval2/Results/guidelines.htm#rawdata Senseval-1 : http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html Fixed Senseval-1: http://www.d.umn.edu/~moha0149/research.html More information about Senseval data and the Senseval exercise may be obtained from the Senseval web page: http://www.itri.brighton.ac.uk/events/senseval/ Token files used by `preprocess.pl' are at : $POSHOME/data/TOKEN/ May place input Senseval-2 format files at : $POSHOME/data/user/ Output created by package may be placed at : $POSHOME/data/output/ Senseval-1 and Senseval-2 test and training files which have been processed by the `refine' package are at $POSHOME/data/samples/ These files have the head words pre-tagged and sentences placed on new lines. They have been created by running the script `proc' of `refine' with the following command line options: proc 1 2 1 2 1 DESTINATION SOURCE proc 2 2 1 2 1 DESTINATION SOURCE proc 3 2 1 2 1 DESTINATION SOURCE proc 4 2 1 2 1 DESTINATION SOURCE For further details see README of `refine'. These files are described in the sections to follow. The change in location of the above stated data files is possible but entails appropriate changes in the script. All the perl programs assume the Perl software to be at `/usr/bin'. If this is not the case requisite changes will need to be made to the first line of the perl programs. An alternative to changing the code for this purpose is to alias `/usr/bin/perl' to the appropriate perl directory. For eg. if the perl software is located in the `usr/local/bin/perl' directory, the following command aliases `usr/bin/perl' to `/usr/local/bin/perl'. alias /usr/bin/perl /usr/local/bin/perl 5. NECESSARY SOFTWARE: ===================== The following software must be downloaded and their locations placed in the PATH, in order to successfully part of speech tag data. 1. posSenseval: http://www.d.umn.edu/~moha0149/research.html The complete path of the posSenseval directory, created after unpacking the package, needs to be set as an environment variable in your .cshrc file. This is how I set it... setenv POSHOME $POSPACKAGE/posSenseval ...where $POSPACKAGE represents the directory where the package has been unpacked. The $POSPACKAGE/posSenseval directory must be added to the PATH as well. This is how I did it... set path = ($POSHOME $path) 2. SenseTools (version 0.1 or above): http://www.d.umn.edu/~tpederse/senseval2.html The directory which contains `preprocess.pl' (a part of SenseTools) should be placed in the PATH. This is how I made the entry in my .cshrc file assuming I have downloaded Sensetools in $SENSEHOME(say) and the version I am using is 0.1. set path = ($SENSEHOME/SenseTools-0.1 $path) 3. Brill Tagger : http://www.cs.jhu.edu/~brill `pos' copies the data files after appropriate pre-processing to the directory housing Brill tagger's executable. A necessary pre-condition of using the Brill tagger is that the tagging has to be done in the directory containing the initial and final state tagger. Note: Brill Taggers home directory need not be added to the PATH explicitly; the script takes care of it. It is also suggested that the patch to the Brill Tagger, BrillPatch [Mohammad Pedersen 2002] be downloaded from... http://www.d.umn.edu/~moha0149/research.html The application of this patch guarantees pre-tagging and applies lexicalized contextual rules even when the data is pre-tagged. This ensures a higher precision of tagging when `pos' is used in combination with `refine', which pre-tags the data. Please see the paper [Mohammad Pedersen 2002] for further details. 6 GENERATING PART OF SPEECH TAGGED DATA: ======================================= Part of speech tagging data in Senseval-2 data format may be done using the following script. Type... pos BRILL OUTPUT SOURCE [tr] where, BRILL is the path to the directory which houses brill tagger's executable (tagger) OUTPUT is the name of the directory where all the output files are to be created. The directory specified will be created by the script and should not be existing already. The final part of speech tagged output file is named output.xml SOURCE is the name of the Senseval-2 data format file to be tagged. A complete or relative path where the file exists may be specified. If `tr' is specified as the fourth command line argument, the individual lexical element files are named as training files. In all other cases they are named as test files. If `refine' is used for pre-tagging and/or split sentence restoration, its final output file ( such as multisent.xml or replace.xml) should be used as the SOURCE file. Individual files generated for each lexical element can be identified by their `.count' and `.xml' extensions. These files are created in the $OUTPUT/LexSamples/ directory. A copy of the same is also placed in the $POSHOME/data/WEKA directory. Thus, in case you run this script with both training and evaluation data, the training and evaluation .xml and .count files are stored together for each lexical element in this directory. For eg. $POSHOME/data/WEKA/art.n will have four files - the .xml and .count training and evaluation files. The data generated has all contextual alphabetic characters in lower case. This is done so that strings with different capitalizations may be treated as the same type by the systems which utilize this data. The script `pos' first copies the SOURCE file into the OUTPUT directory by the name `1.txt'. It then creates a number of intermediate files before finally creating `output.xml' which is the part of speech tagged SOURCE file. It then goes on to create the individual lexical element files as mentioned above. 7. DETAILS OF GENERATING PART OF SPEECH TAGGED DATA: =================================================== Following is a description of the various perl programs utilized by `pos' to do the part of speech tagging. The tagging process may be broken into three steps: Pre-Processing, the actual tagging and Post-Processing. 7.1 PRE-PROCESSING: ================== This is the stage in which data is stripped of its xml and sgml tags and appropriately processed to make it suitable to be tagged by the Brill Tagger. 7.1.1 REMOVING XML TAGS WITHIN CONTEXT (filter1.pl): --------------------------------------------------- filter1.pl removes the data within angular braces in the context of the SOURCE file. This data is non-contextual and is thus eliminated before the tagging phase along with the angular braces. The output is placed in the DESTINATION file. The data which is filtered out is placed in the FILTER file. Usage: filter1.pl [OPTIONS] FILTER DESTINATION SOURCE OPTIONS: --help Prints the help message. The script runs the program with the following options: filter1.pl filter1.out 2.txt 1.txt 7.1.2 ELIMINATING THE REMAINING NON-CONTEXTUAL DATA (filter2.pl): ---------------------------------------------------------------- `filter2.pl' is a perl program that eliminates the non-contextual data within square brackets.It makes the assumption that data within square brackets is non-contextual and hence exceptions to the same are first converted to the SGML format (`[' to `&lsqb' and `]' to `&rsqb'). The exceptions are identified by the presence of multiple tokens within the square braces. The program eliminates data within curly braces. This data is assumed to be non-contextual as found in Senseval-1 data. An exception to this is typographical error information within curly braces. This information is replaced by the correct spelling. As is the case with square braces, if the curved braces are a part of the context, they must be specified by their appropriate character references - `{' and `}' represent left and right curly braces, respectively. The new line character in the second line of Senseval-1 test data which is within angular braces is eliminated by this program. It non-tokenizes the `\/' symbol and the pre-tag slashes `//`. It also eliminates `\226' and `\227' tokens which are not part of the contextual information. Following is the usage of filter2.pl : Usage: filter2.pl [OPTIONS] FILTER DESTINATION SOURCE [SOURCE]... The SOURCE file is processed to form the DESTINATION file. All the data eliminated is stored in the FILTER for reference. OPTIONS: --help Prints the help message. The script runs the program with the following options: filter2.pl filter2.out 3.txt 2.txt 7.1.3 ELIMINATING XML TAGS (preprocess.pl): ------------------------------------------ `preprocess.pl' is used to eliminate xml tags before tagging and later to generate xml and count files for individual lexical elements. It is a part of Sensetools which can be downloaded from... http://www.d.umn.edu/~tpederse/senseval2.html I have used it with the following options to process the output file of filter2.pl(3.txt). preprocess.pl 3.txt --token $POSHOME/data/TOKEN/token1.txt / --count OUTPUT --noxml --putSentenceTags The command produces a file (OUTPUT) which contains only the contextual information - no tags other than