README ------ parseSenseval Version 0.1 Copyright (C) 2001-2003 Mohammad Saif, moha0149@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth ##################### LAST UPDATED: Dec, 2003 ########################### parseSenseval is a software package to syntactically parse any text that is part of speech tagged in a format as given out by the Brill Part of Speech Tagger[Brill, 1992][Brill, 1994]. The Collin's parser is utilized for the purpose. Perl programs are provided to pre-procees the text to make it suitable for parsing by the parser. The output of the parser is post-processed to xmlize (place in angular brace <>)the various syntactic tags. A script `parse' is provided to take you through the complete process. The script takes as input the Senseval-2 data format file corresponding to the part of speech tagged text and puts back all the xml tags into the parsed and xmlized text. The script creates a text file in Senseval-2 data format which is enriched with xml tags corresponding to the parse information of the sentences. ##################### TO IMMEDIATELY PARSE DATA ######################### # # 1. Download and unpack: # a. parseSenseval: # http://www.d.umn.edu/~moha0149/research.html # Set environment variable and path as shown: # setenv PARSEHOME $PARSEPACKAGE/parseSenseval # set path = ($PARSEHOME $path) # ...where $PARSEPACKAGE represents the directory where the # package is unpacked. # # b. SenseTools : # http://www.d.umn.edu/~tpederse/senseval2.html # Set PATH as shown: # set path = ($SENSEHOME/SenseTools-0.1 $path) # ...where $SENSEHOME represents the directory where Sensetools # is unpacked and given that the version being used is 0.1. # # c. Collin's Parser: # http://www.ai.mit.edu/people/mcollins # Compile the parser. # # 2. Type: # parse COLLIN MODEL OUTPUT SENSEVAL2 POS [train] # # COLLIN is the Collin Parsers home directory. This directory has # the README for the parser and subdirectories such as `code' and # `models'. The directory path is utilized to access appropriate # parser executable. # # MODEL specifies which model of the Collins Parser is to be used. # It can take one of three values: 1,2 or 3 corresponding to # the respective models. # # OUTPUT is the name of the directory where all the output files # are to be created. The directory specified will be created by # the script and should not be existing already. # # SENSEVAL2 is the original Senseval-2 data format file. # # POS is the name of the input file which has sentences which # are part of speech tagged. The format of the # sentences is expected to be the format in which the Brill tagger # produces the part of speech tagged sentences. # A complete or relative path to the file may be specified. # # The fourth command line argument, is specified as `train' # causes the final parsed xml and count files to be named as # training files. If not specified the files are named as test # files. Note: This argument does not in any way influence the # parsing or contents of the file. # ########################################################################### ################################ DETAILS ################################## 1. INTRODUCTION: =============== Identifying the various syntactic relations amongst the words and phrases within a sentence is known as parsing. A phrase is a sequence of words which together has some meaning but is not enough to get across an idea or thought. For example, {\em the deep ocean}. The word within the phrase which is central in determining the relation of the phrase with other phrases of the sentence is known as the head word or simple head of the phrase. The part of speech of the head word determines the syntactic identity of the phrase. The aforementioned phrase has {\em Ocean} as the head, which is a noun. The phrase is thus termed a noun phrase. A parser is used to automatically parse a sentence. Consider the sentence. Harry Potter cast a bewitching spell Some of the aspects a parser might identify are that the sentence is composed of a noun phrase {\em Harry Potter} and a verb phrase {\em cast a bewitching spell}. The verb phrase is in turn made of a verb {\em cast} and a noun phrase {\em a bewitching spell}. We shall call the verb phrase parent (phrase) of the verb {\em cast} and the noun phrase {\em a bewitching spell}. Conversely, the verb and the noun phrase shall be referred to as child (phrases) of the verb phrase. The parsed output will also contain the head words of the various phrases and a hierarchical relation amongst the phrases depicted by a parse tree. Numerous parsers are available commercially and in the public domain, such as, the Charniak Parser[Charniak, 2000], MINIPAR[Lin, 1998], Cass Parser[Abney, 1996][Abney, 1997] and the Collin's Parser[Collins, 1999][Collins, 2000] were considered. The Collin's Parser[Collins, 1999][Collins, 2000] due to its many desired attributes (listed below) is used by this package. a. Source code of the parser is available. This enables us to better understand the tagger and use it to the best potential. b. It takes as input part of speech tagged text. There are many parsers that take raw sentences as input and both part of speech tag them and provide a parsed structure. This, however, will mean that we shall be unable to utilize the Brill Tagger[Brill, 1992] [Brill, 1994] and guaranteed pre-tagging[Mohammad and Pedersen, 2002] of the head words which we believe provide high quality part of speech tagging. The Senseval-2 exercise held in 2001, brought together numerous word sense disambiguation systems from all over the world on one platform. The results from all these systems could be compared as the systems were trained and evaluated on a common sense tagged data set. An offshoot of which was that all the systems involved were designed to accept a common data format - the Senseval-2 data format. It may be noted that data in Senseval-2 format has numerous xml tags, identified by angular braces (<>), which are not a part of the contextual information. This package post-processes the output of the parser to xmlize (place in angular brace <>)the various syntactic tags. The master script `parse', takes as input the Senseval-2 data format file corresponding to the part of speech tagged text and puts back all the xml tags into the parsed and xmlized text. The script creates a text file in Senseval-2 data format which is enriched with xml tags corresponding to the parse information of the sentences. This text may be utilized for numerous Natural Language Tasks such as word sense disambiguation, which may utilize the parse information. The choice of the Senseval-2 as the data format is justified by the significant amount of recent and expected future work in this format. The Senseval-3 exercise will continue to use the Senseval-2 data format. Data which has extensively been used in the past such as the "line", "hard", "serve" and "interest" data are also available in Senseval-2 data format at: http://www.d.umn.edu/~tpederse/data.html The lineOneTwo, hardOneTwo, serveOneTwo and interestOneTwo packages which convert the "line", "hard", "serve" and "interest" data respectively, into Senseval-1 data format and then use Sval1to2 to convert them to Senseval-2 data format are available at the authors' web page. Thus, there is a rich source of sense-tagged data available in the Senseval-2 data format which may be used for many of the Natural Language Processing tasks. parseSenseval can be used to enrich these texts with parse information while still maintaining conformance with the Senseval-2 data format. 2. Senseval-2 data format: ========================= The English lexical sample space of Senseval-2 has 4328 instances of evaluation data and 8611 instances of training data. Each instance has a sense tagged sentence along with a few neighboring sentences forming its context. The data is partitioned into two sets - the training data and the evaluation data. In the evaluation data, one word in the context is marked (surrounded by and ) as the target word, whose sense as intended in the context is to be disambiguated. It also has other xml tags (demarcated by angular braces) and certain sgml tags (placed in square braces). The training data has the same kind of tags but additionally has the intended sense of the target word. Information about the Senseval-2 data format is available at its web page. http://www.sle.sharp.co.uk/senseval2 3. LOCATION OF FILES: ==================== On unpacking the package all files and data created are within the `parseSenseval' directory. This directory is created in the directory where the package was unpacked. Let PARSEHOME represent the complete path to parseSenseval. The scripts and all the perl programs as indicated by their `.pl' extension are located here. The raw part of speech tagged Senseval-1, Senseval-2, "line", "hard", "serve" and "interest" data, as created by the Brill Tagger may be obtained from the authors' web page. All the perl programs assume the Perl software to be at `/usr/bin'. If this is not the case requisite changes will need to be made to the first line of the perl programs. An alternative to changing the code for this purpose is to alias `/usr/bin/perl' to the appropriate perl directory. For eg. if the perl software is located in the `usr/local/bin/perl' directory, the following command aliases `usr/bin/perl' to `/usr/local/bin/perl'. alias /usr/bin/perl /usr/local/bin/perl 4. NECESSARY SOFTWARE: ===================== The following software must be downloaded and their locations placed in the PATH, in order to successfully part of speech tag data. 1. parseSenseval: http://www.d.umn.edu/~moha0149/research.html The complete path of the parseSenseval directory, created after unpacking the package, needs to be set as an environment variable in your .cshrc file. This is how I set it: setenv PARSEHOME $PARSEPACKAGE/parseSenseval ...where $PARSEPACKAGE represents the directory where the package has been unpacked. The $PARSEPACKAGE/parseSenseval directory must be added to the PATH as well. This is how I did it: set path = ($PARSEHOME $path) 2. SenseTools (version 0.1 or above): http://www.d.umn.edu/~tpederse/senseval2.html The directory which contains `preprocess.pl' (a part of SenseTools) should be placed in the PATH. This is how I made the entry in my .cshrc file assuming I have downloaded Sensetools in $SENSEHOME(say) and the version I am using is 0.1. set path = ($SENSEHOME/SenseTools-0.1 $path) 3. Collin's Parser: http://www.ai.mit.edu/people/mcollins Compile the parser as directed in its README. 5 MASTER SCRIPT: ================ A script, `parse', is provided with this package which takes you through the complete process of preprocessing text to make it suitable for parsing by the Collins Parser, parsing the text and post-processing to xmlize the syntactic tags and convert to Senseval-2 data format. The script takes as input the Senseval-2 data format file corresponding to the part of speech tagged text and puts back all the xml tags into the parsed and xmlized text. The script creates a text file in Senseval-2 data format which is enriched with xml tags corresponding to the parse information of the sentences. Following is the usage of `parse'. parse COLLIN MODEL OUTPUT SENSEVAL2 POS [train] COLLIN is the Collin Parsers home directory. This directory has the README for the parser and subdirectories such as `code' and `models'. The directory path is utilized to access appropriate parser executable. MODEL specifies which model of the Collins Parser is to be used. It can take one of three values: 1,2 or 3 corresponding to the respective models. OUTPUT is the name of the directory where all the output files are to be created. The directory specified will be created by the script and should not be existing already. SENSEVAL2 is the original Senseval-2 data format file. POS is the name of the input file which has sentences which are part of speech tagged. The format of the sentences is expected to be the format in which the Brill tagger produces the part of speech tagged sentences. A complete or relative path to the file may be specified. The fourth command line argument, is specified as `train' causes the final parsed xml and count files to be named as training files. If not specified the files are named as test files. Note: This argument does not in any way influence the parsing or contents of the file. The script `parse' first copies the SENSEVAL2 and POS file into the OUTPUT directory by the names `sval2.xml' and `1.txt'. It then creates a number of intermediate files before finally creating `parse.xml' which is the Senseval-2 data format file enriched with xml tags which have parse information of the sentences. `parse.xml' is created in $OUTPUT/SPLIT directory. It then goes on to create the individual lexical element files. Individual files generated for each lexical element can be identified by their `.count' and `.xml' extensions. These files are created in the $OUTPUT/SPLIT/LexSamples/ directory. The data generated has all contextual alphabetic characters in lower case. This is done so that strings with different capitalizations may be treated as the same type by the systems which utilize this data. 6. DETAILS OF INDIVIDUAL PERL PROGRAMS: ===================================== Following is a description of the various perl programs utilized by `parse' to do the pre-processing, parsing and post-processing. The programs are described in order of appearance in the script. 6.1 PRE-PROCESSING: ================== This is the stage in which the part of speech tagged text in the Brill Tagger format is converted to a format acceptable by the Collins Parser. 6.1.1 RE-FORMATTING (ColinPrep.pl): --------------------------------------------------- ColinPrep.pl reformats data in the SOURCE file to a format suitable for parsing by Collins Parser. It takes as input part of speech tagged sentences such that: a. Token and its part of speech are separated by a forward slash `/'. b. All sentences are on separate lines. Output is placed in the DESTINATION file. The FILTER collects lines with greater than NUM words. USAGE : ColinPrep.pl [OPTIONS] NUM DESTINATION SOURCE > FILTER OPTIONS: --version Prints the version number. --help Prints the help message. Following is a list of its functions: a. Count number of tokens in sentence; place count at start of sentence b. Sentences which have more than NUM tokens are printed to standard output and may be captured in a file by redirection. A count of the total number of such sentences is also printed out. c. Separate word and pos by a space; eliminate the separating slash d. Eliminate additional tags for word if specified - identified by `|' e. Covert Brill tags such as `(' and `)' which are not understood by Collins parser into most similar tags such as `SYM'. The script runs the program with the following options: ColinPrep.pl 119 ready2parse 1.txt > filter NUM is chosen to be 119 as the Collins Parser does not accept sentences with 120 or more tokens. `1.txt' is a copy of the part of speech tagged file. 6.1.2 SPLITTING (chunk.pl): -------------------------- chunk.pl splits a SOURCE file into multiple files. Each piece has NUM lines except the last file which may have less. The output files are named DESTINATION.1, DESTINATION.2 and so on. USAGE : chunk.pl [OPTIONS] NUM DESTINATION SOURCE --version Prints the version number. --help Prints help message. The script runs the program with the following options: chunk.pl 2400 parse ../ready2parse NUM is chosen to be 2500 as the Collins Parser does not accept files with 2500 or more sentences. The command is executed in the $OUTPUT/SPLIT directory. Thus all the individual split files are created here. 6.2 PARSING (COLLINS PARSER): ============================ The Collins Parser is used to parse the syntactically sentences. The parser is available at: http://www.ai.mit.edu/people/mcollins Compile the parser as directed in its README. We have used it with the suggested default options: gunzip -c $COLLIN/models/model1/events.gz | $COLLIN/code/parser \ ./$INPUT $COLLIN/models/model1/grammar 10000 1 1 1 1 > ./$OUTPUT gunzip -c $COLLIN/models/model2/events.gz | $COLLIN/code/parser \ ./$INPUT $COLLIN/models/model2/grammar 10000 1 1 1 1 > ./$OUTPUT gunzip -c $COLLIN/models/model3/events.gz | $COLLIN/code/parser \ ./$INPUT $COLLIN/models/model3/grammar 10000 1 1 1 1 > ./$OUTPUT "parse" executes one of the three commands listed above depending on the MODEL specified as command line argument. The details of the three models may be found in Michael Collins' thesis [Collins 99]. A brief description is given below: MODEL 1: A lexicalized probabilistic context free grammar approach. MODEL 2: Adds ability to distinguish between complements (subjects and objects) and adjuncts. Additionally, it uses probability distributions of the subcategorization frames of the head words. MODEL 3: Builds on MODEL2 by probabilistically identifying where the wh-movement has occurred. Output is modified to put trace symbols. INPUT the part of speech tagged text in a format acceptable by the Collins Parser with less than 2500 sentences. OUTPUT is the parsed output of the INPUT file. Each of the individual split up files, described in the previous section, is parsed this way. 6.3 POST PROCESSING: =================== This stage involves placing all the xml and sgml tags back into the part of speech tagged data and formatting the part of speech tags as xml tags. 6.3.1 XML'izing The Parse Tags (parse2xml.pl): --------------------------------------------- This program takes as input the output of the Collins Parser. The tree format output is ignored. The bracketed form of output is placed in DESTINATION. The parse information and part of speech tags are placed in angular braces. Open brace `(' signifies the start of a phrase and the token(TOK) that follows is placed within angular braces: The `P' tells us that this xml tag has parse information. A sample of the conversion is provided below: (NPB~company~2~2 The/DT company/NN ) is converted to The company

The Collins Parser adds token-part of speech pairs `t/TRACE' and `T/TRACE'. These tokens were not part of the original sentence and are hence eliminated. SOURCE has the output of the parser. The xmlized output is placed in DESTINATION. parse2xml.pl has the following usage: USAGE : parse2xml.pl OPTIONS DESTINATION SOURCE --version Prints the version number. --help Prints help message. The script runs the program with the following options: parse2xml.pl parse.xml1 parse.txt 6.3.2 RETURN OF THE SENSEVAL TAGS (xml2sval.pl): ----------------------------------------------- This program takes as input two files. SOURCE1 and SOURCE2 have the same tokens and in the same order. However, SOURCE1 and SOURCE2 might have different XML tags. `xml2sval.pl' places a copy of SOURCE2 in DESTINATION and adds the xml tokens from SOURCE1 at corresponding positions relative to the tokens. Each (non-xml) token in the SOURCE1 is compared with the corresponding token in SOURCE2. Any mismatch causes the relevant lines of the two files to be printed in the file FILTER. This file is not created if there is perfect synchronization among the two files. The contextual alphabetic characters are converted to the lower case. Other functions include: a. Non-contextual data within square brackets is put in angular braces b. Non-contextual data within curly braces (found in Senseval-1 data) also placed in angular braces. An exception to this being typo error information in curly braces; the information is replaced by the correct spelling. c. Character references which were eliminated during pre-processing are put in angular braces. It has the following usage: USAGE : xml2sval.pl [OPTIONS] DESTINATION SOURCE1 SOURCE2 The script runs the program with the following options: xml2sval.pl parse.xml ../sval2.xml parse.xml1 6.3.3 FILES FOR EACH LEXICAL ELEMENT: ------------------------------------ Individual xml and count files for each lexical element may be generated by processing the parsed Senseval-2 data format file generated above `parse.xml', with preprocess.pl. Following is what I have used: preprocess.pl output.xml --nontoken $PARSESENSEVAL/non-token.txt --token $PARSESENSEVAL/token2.txt --putSentenceTags The xml and count files are created in the directory corresponding to the lexical element in the OUTPUT directory: $OUTPUT/SPLIT/LexSample 6.3.4 Eliminating and Tags (newline.pl): ------------------------------------------------ `newline.pl' eliminates sentence delimiter tags in the SOURCE file(s) and places sentences on new lines. The sentence boundaries should be demarcated by ... in the SOURCE file. The output is stored in the DESTINATION file. Usage : newline.pl.pl [OPTIONS] DESTINATION SOURCE Each of the individual lexical element parsed files created by the preprocess.pl program, is processed with newline.pl. 7. MISCELLANEOUS SCRIPTS: ======================== numparse.pl count parse.xml Apart from the debug and testing files mentioned above, following scripts can be used to check the validity and sanctity of the part of speech tagged data produced. head.pl : This script can be used to list all head instances in the initial source file and the final part of speech tagged files. The data should match both in number and what they hold. This is how I have used it: head.pl SOURCE1 > DESTINATION1 head.pl SOURCE2 > DESTINATION2 where, SOURCE1 is the file before any pre-processing by this script. SOURCE2 is the output file of xmlise.pl DESTINATION1&2 are the output files which list the head instances. sat.pl : This script can be used to list all `sat' instances in the initial source file and the final part of speech tagged files. The data should match both in number and what they hold. This is how I have used it: sat.pl SOURCE1 > DESTINATION1 sat.pl SOURCE2 > DESTINATION2 where, SOURCE1 is the file before any pre-processing by this script. SOURCE2 is the output file of xmlise.pl DESTINATION1&2 are the output files which list the sat instances. The script runs the programs with the following options: head.pl 1.txt > head-in.out head.pl 7.txt > head-out.out sat.pl 1.txt > sat-in.out sat.pl 7.txt > sat-out.out 10. Copying: =========== This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 11. REFERENCES: ============== 1. [Collins 99] M.Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999. 2. [Brill 92] E.Brill. A Simple Rule-Based Part of Speech Tagger. In Proceedings of the Third Conference on Applied Computational Linguistics, Trento, Italy, 1992. 3. [Brill94] E.Brill. Some Advances in Rule-Based Part of Speech Tagging. Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. 4. [Brill95] E.Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565, 1995. 5. [Mohammad Pedersen 2002] S.Mohammad and T.Pedersen. Guaranteed Pre-Tagging for the Brill Tagger. In the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics(CICLing-2003), in Fenruary 2003, in Mexico City. 6. [RamshawM94] L.Ramshaw and M.Marcus. Exploring the statistical derivation of transformational rule sequences for part-of-speech tagging. In ACL Balancing Act Workshop, pages 86-95, 1994.