README
				------
		        parseSenseval Version 0.1

               		Copyright (C) 2001-2003
		   Mohammad Saif, moha0149@d.umn.edu
               	   Ted Pedersen, tpederse@d.umn.edu
               	    University of Minnesota, Duluth

##################### LAST UPDATED: Dec, 2003 ###########################

parseSenseval is a software package to syntactically parse any text 
that is part of speech tagged in a format as given out by the Brill Part 
of Speech Tagger[Brill, 1992][Brill, 1994]. The Collin's parser is utilized
for the purpose.
Perl programs are provided to pre-procees the text to make it suitable  
for parsing by the parser. The output of the parser is post-processed
to xmlize (place in angular brace <>)the various syntactic tags.
A script `parse' is provided to take you through the complete process.
The script takes as input the Senseval-2 data format file corresponding
to the part of speech tagged text and puts back all the xml tags into 
the parsed and xmlized text. The script creates a text file in Senseval-2 
data format which is enriched with xml tags corresponding to the parse 
information of the sentences.

##################### TO IMMEDIATELY PARSE DATA #########################
#
# 1. Download and unpack:
#	a. parseSenseval:
#	      http://www.d.umn.edu/~moha0149/research.html 
#	   Set environment variable and path as shown:
#	      setenv PARSEHOME $PARSEPACKAGE/parseSenseval 
#	      set path = ($PARSEHOME $path)
#	   ...where $PARSEPACKAGE represents the directory where the 
#	   package is unpacked.  
#
#	b. SenseTools :
#	      http://www.d.umn.edu/~tpederse/senseval2.html 
#	   Set PATH as shown:
#	      set path = ($SENSEHOME/SenseTools-0.1 $path) 
#	   ...where $SENSEHOME represents the directory where Sensetools 
#	   is unpacked and given that the version being used is 0.1.  
#
#	c. Collin's Parser: 
#	      http://www.ai.mit.edu/people/mcollins
#	   Compile the parser.
#
# 2. Type:
#	parse COLLIN MODEL OUTPUT SENSEVAL2 POS [train]
#
#         COLLIN is the Collin Parsers home directory. This directory has
#         the README for the parser and subdirectories such as `code' and
#         `models'. The directory path is utilized to access appropriate
#         parser executable.
#
#	  MODEL specifies which model of the Collins Parser is to be used.
#         It can take one of three values: 1,2 or 3 corresponding to
#         the respective models.
#
#         OUTPUT is the name of the directory where all the output files
#         are to be created. The directory specified will be created by
#         the script and should not be existing already.
#
#         SENSEVAL2 is the original Senseval-2 data format file.
#
#         POS is the name of the input file which has sentences which
#         are part of speech tagged. The format of the
#         sentences is expected to be the format in which the Brill tagger
#         produces the part of speech tagged sentences.
#         A complete or relative path to the file may be specified.
#
#         The fourth command line argument, is specified as `train'
#         causes the final parsed xml and count files to be named as
#         training files. If not specified the files are named as test
#         files. Note: This argument does not in any way influence the
#         parsing or contents of the file.
#
###########################################################################

################################ DETAILS ##################################

1. INTRODUCTION:
===============

Identifying the various syntactic relations amongst the words and
phrases within a sentence is known as parsing. A phrase is a sequence
of words which together has some meaning but is not enough to get
across an idea or thought. For
example, {\em the deep ocean}. The word within the phrase which is
central in determining the relation of the phrase with other phrases
of the sentence is known as the head word or simple head of the
phrase. The part of speech of the head word determines the syntactic
identity of the phrase. The aforementioned phrase has {\em Ocean} as
the head, which is a noun. The phrase is thus termed a noun phrase.  A
parser is used to automatically parse a sentence. Consider the
sentence.

	Harry Potter cast a bewitching spell

Some of the aspects a parser might identify are that the sentence is
composed of a noun phrase {\em Harry Potter} and a verb phrase {\em
cast a bewitching spell}.  The verb phrase is in turn made of a verb
{\em cast} and a noun phrase {\em a bewitching spell}. We shall call
the verb phrase parent (phrase) of the verb {\em cast} and the noun
phrase {\em a bewitching spell}. Conversely, the verb and the noun
phrase shall be referred to as child (phrases) of the verb phrase. The
parsed output will also contain the head words of the various phrases
and a hierarchical relation amongst the phrases depicted by a parse
tree.
Numerous parsers are available commercially and in the
public domain, such as, the Charniak Parser[Charniak, 2000],
MINIPAR[Lin, 1998], Cass Parser[Abney, 1996][Abney, 1997] and
the Collin's Parser[Collins, 1999][Collins, 2000] 
were considered. The Collin's Parser[Collins, 1999][Collins, 2000]
due to its many desired
attributes (listed below) is used by this package.

  a. Source code of the parser is available. This enables us to
better understand the tagger and use it to the best potential.

  b. It takes as input part of speech tagged text. There are many
parsers that take raw sentences as input and both part of
speech tag them and provide a parsed structure. This, however, will
mean that we shall be unable to utilize the Brill Tagger[Brill, 1992]
[Brill, 1994] and guaranteed pre-tagging[Mohammad and Pedersen, 2002] 
of the head words which we believe provide high
quality part of speech tagging.

The Senseval-2 exercise held in 2001, brought together numerous word
sense disambiguation systems from all over the world on one platform.
The results from all these systems could be
compared as the systems were trained and evaluated on a common sense
tagged data set. An offshoot of which was that all the systems
involved were designed to accept a common data format - the Senseval-2
data format. It may be noted that data in Senseval-2 format has numerous xml tags, identified
by angular braces (<>), which are not a part of the contextual information. 
This package post-processes the output of the parser to xmlize (place in 
angular brace <>)the various syntactic tags. The master script `parse',
takes as input the Senseval-2 data format file corresponding
to the part of speech tagged text and puts back all the xml tags into
the parsed and xmlized text. The script creates a text file in Senseval-2
data format which is enriched with xml tags corresponding to the parse
information of the sentences. This text may be utilized for numerous
Natural Language Tasks such as word sense disambiguation, which may
utilize the parse information.

The choice of the Senseval-2 as the data format is justified by 
the significant amount of recent and expected future work in this
format. The Senseval-3 exercise will continue to use the Senseval-2 
data format. Data which has extensively been used in the past such
as the "line", "hard", "serve" and "interest" data are also 
available in Senseval-2 data format at:

	http://www.d.umn.edu/~tpederse/data.html

The lineOneTwo, hardOneTwo, serveOneTwo and interestOneTwo packages
which convert the "line", "hard", "serve" and "interest" data
respectively, into Senseval-1 data format and then use Sval1to2 to
convert them to Senseval-2 data format are available at the authors'
web page.
Thus, there is a rich source of sense-tagged data available in the
Senseval-2 data format which may be used for many of the Natural
Language Processing tasks. parseSenseval can be used to enrich these
texts with parse information while still maintaining conformance
with the Senseval-2 data format. 

2. Senseval-2 data format:
=========================

The English lexical sample space of Senseval-2 has 4328 instances of
evaluation data and 8611 instances of training data. Each instance has
a sense tagged sentence along with a few neighboring sentences forming
its context. The data is partitioned into two sets - the training data
and the evaluation data. In the evaluation data, one word in the
context is marked (surrounded by <head> and </head>) as the target
word, whose sense as intended in the context is to be
disambiguated. It also has other xml tags (demarcated by angular
braces) and certain sgml tags (placed in square braces). The training
data has the same kind of tags but additionally has the intended sense
of the target word. Information about the Senseval-2 data format is
available at its web page.
 
	http://www.sle.sharp.co.uk/senseval2


3. LOCATION OF FILES: 
====================

On unpacking the package all files and data created are within the
`parseSenseval' directory. This directory is created in the directory
where the package was unpacked. Let PARSEHOME represent the complete
path to parseSenseval. The scripts and all the perl programs as
indicated by their `.pl' extension are located here.

The raw part of speech tagged Senseval-1, Senseval-2, "line", "hard",
"serve" and "interest" data, as created by the Brill Tagger may be 
obtained from the authors' web page.

All the perl programs assume the
Perl software to be at `/usr/bin'. If this is not the case requisite
changes will need to be made to the first line of the perl programs.
An alternative to changing the code for this purpose is to alias 
`/usr/bin/perl' to the appropriate perl directory. For eg. if the 
perl software is located in the `usr/local/bin/perl' directory, the 
following command aliases `usr/bin/perl' to `/usr/local/bin/perl'.

	alias /usr/bin/perl /usr/local/bin/perl

4. NECESSARY SOFTWARE: 
=====================

The following software must be downloaded and their locations placed
in the PATH, in order to successfully part of speech tag data.

1. parseSenseval:
	http://www.d.umn.edu/~moha0149/research.html	
	
The complete path of the parseSenseval directory, created after
unpacking the package, needs to be set as an environment variable in
your .cshrc file. This is how I set it:

	setenv PARSEHOME $PARSEPACKAGE/parseSenseval

...where $PARSEPACKAGE represents the directory where the package has
been unpacked. The $PARSEPACKAGE/parseSenseval directory must be added to
the PATH as well. This is how I did it:

	set path = ($PARSEHOME $path)

2. SenseTools (version 0.1 or above): 
	http://www.d.umn.edu/~tpederse/senseval2.html

The directory which contains `preprocess.pl' (a part of SenseTools)
should be placed in the PATH. This is how I made the entry in my
.cshrc file assuming I have downloaded Sensetools in $SENSEHOME(say)
and the version I am using is 0.1.

   	set path = ($SENSEHOME/SenseTools-0.1 $path)

3. Collin's Parser:

        http://www.ai.mit.edu/people/mcollins

   Compile the parser as directed in its README.


5 MASTER SCRIPT:
================

A script, `parse', is provided with this package which takes you through
the complete process of preprocessing text to make it suitable for
parsing by the Collins Parser, parsing the text and post-processing
to xmlize the syntactic tags and convert to Senseval-2 data format.
The script takes as input the Senseval-2 data format file corresponding
to the part of speech tagged text and puts back all the xml tags into
the parsed and xmlized text. The script creates a text file in Senseval-2
data format which is enriched with xml tags corresponding to the parse
information of the sentences. Following is the usage of `parse'.

    parse COLLIN MODEL OUTPUT SENSEVAL2 POS [train]

         COLLIN is the Collin Parsers home directory. This directory has
         the README for the parser and subdirectories such as `code' and
         `models'. The directory path is utilized to access appropriate
         parser executable.

	 MODEL specifies which model of the Collins Parser is to be used.
         It can take one of three values: 1,2 or 3 corresponding to
         the respective models.

         OUTPUT is the name of the directory where all the output files
         are to be created. The directory specified will be created by
         the script and should not be existing already.

         SENSEVAL2 is the original Senseval-2 data format file.

         POS is the name of the input file which has sentences which
         are part of speech tagged. The format of the
         sentences is expected to be the format in which the Brill tagger
         produces the part of speech tagged sentences.
         A complete or relative path to the file may be specified.

         The fourth command line argument, is specified as `train'
         causes the final parsed xml and count files to be named as
         training files. If not specified the files are named as test
         files. Note: This argument does not in any way influence the
         parsing or contents of the file.

The script `parse' first copies the SENSEVAL2 and POS file into the OUTPUT directory
by the names `sval2.xml' and `1.txt'. It then creates a number of intermediate files
before finally creating `parse.xml' which is the Senseval-2  
data format file enriched with xml tags which have parse information
of the sentences. `parse.xml' is created in $OUTPUT/SPLIT directory.
It then goes on to create the individual
lexical element files.
Individual files generated for each lexical element can be identified
by their `.count' and `.xml' extensions. These files are created in
the $OUTPUT/SPLIT/LexSamples/ directory.  

The data generated has all contextual alphabetic characters in lower
case. This is done so that strings with different capitalizations may
be treated as the same type by the systems which utilize this data.

6. DETAILS OF INDIVIDUAL PERL PROGRAMS:
=====================================

Following is a description of the various perl programs utilized by
`parse' to do the pre-processing, parsing and post-processing. 
The programs are described in order of appearance in the script.

6.1 PRE-PROCESSING:
==================

This is the stage in which the part of speech tagged text in the 
Brill Tagger format is converted to a format acceptable by the
Collins Parser.

6.1.1 RE-FORMATTING (ColinPrep.pl):
---------------------------------------------------

ColinPrep.pl reformats data in the SOURCE file to  a format suitable
for parsing by Collins Parser. It takes as input part of speech tagged
sentences such that:

a. Token and its part of speech are separated by a forward slash `/'.
b. All sentences are on separate lines.

Output is placed in the DESTINATION file. The FILTER collects lines 
with greater than NUM words.

USAGE :  ColinPrep.pl [OPTIONS] NUM DESTINATION SOURCE > FILTER

OPTIONS:
     --version    Prints the version number.
     --help       Prints the help message.

Following is a list of its functions:

 a. Count number of tokens in sentence; place count at start of sentence
 b. Sentences which have more than NUM tokens are printed to standard
    output and may be captured in a file by redirection.
    A count of the total number of such sentences is also printed out.
 c. Separate word and pos by a space; eliminate the separating slash
 d. Eliminate additional tags for word if specified - identified by `|'
 e. Covert Brill tags such as `(' and `)' which are not understood by
    Collins parser into most similar tags such as `SYM'.

The script runs the program with the following options:

	ColinPrep.pl 119 ready2parse 1.txt > filter

NUM is chosen to be 119 as the Collins Parser does not accept
sentences with 120 or more tokens. `1.txt' is a copy of the
part of speech tagged file.

6.1.2 SPLITTING (chunk.pl):
--------------------------

chunk.pl splits a SOURCE file into multiple files. Each piece has NUM
lines except the last file which may have less. The output files are
named DESTINATION.1, DESTINATION.2 and so on.

 USAGE :  chunk.pl [OPTIONS] NUM DESTINATION SOURCE

       --version       Prints the version number.
       --help          Prints help message.

The script runs the program with the following options:

	chunk.pl 2400 parse ../ready2parse

NUM is chosen to be 2500 as the Collins Parser does not accept
files with 2500 or more sentences. The command is executed in the
$OUTPUT/SPLIT directory. Thus all the individual split files
are created here.


6.2 PARSING (COLLINS PARSER): 
============================

The Collins Parser is used to parse the syntactically sentences.
The parser is available at:

	http://www.ai.mit.edu/people/mcollins		

Compile the parser as directed in its README.
We have used it with the suggested default options:

	gunzip -c $COLLIN/models/model1/events.gz | $COLLIN/code/parser \
        ./$INPUT $COLLIN/models/model1/grammar 10000 1 1 1 1 > ./$OUTPUT

	gunzip -c $COLLIN/models/model2/events.gz | $COLLIN/code/parser \
        ./$INPUT $COLLIN/models/model2/grammar 10000 1 1 1 1 > ./$OUTPUT

	gunzip -c $COLLIN/models/model3/events.gz | $COLLIN/code/parser \
	./$INPUT $COLLIN/models/model3/grammar 10000 1 1 1 1 > ./$OUTPUT

"parse" executes one of the three commands listed above depending
on the MODEL specified as command line argument. The details of the
three models may be found in Michael Collins' thesis [Collins 99].
A brief description is given below:

	MODEL 1: A lexicalized probabilistic context free grammar
		 approach.

	MODEL 2: Adds ability to distinguish between complements
		 (subjects and objects) and adjuncts. Additionally,
		 it uses probability distributions of the 
		 subcategorization frames of the head words.

	MODEL 3: Builds on MODEL2 by probabilistically identifying
		 where the wh-movement has occurred. Output is
		 modified to put trace symbols.

INPUT the part of speech tagged text in a format acceptable by the 
Collins Parser with less than 2500 sentences. OUTPUT is the parsed
output of the INPUT file. Each of the individual split up files,
described in the previous section, is parsed this way.

6.3 POST PROCESSING:
===================

This stage involves placing all the xml and sgml tags back
into the part of speech tagged data and formatting the part of speech
tags as xml tags.

6.3.1 XML'izing The Parse Tags (parse2xml.pl):
---------------------------------------------

This program takes as input the output of the Collins Parser. The 
tree format output is ignored. The bracketed form of output is 
placed in DESTINATION. 
The parse information and part of speech tags are placed in angular braces. 
Open brace `(' signifies the start of a phrase and the token(TOK) that
follows is placed within angular braces: <P="TOK">
The `P' tells us that this xml tag has parse information.
A sample of the conversion is provided below:

      (NPB~company~2~2 The/DT company/NN )
                is converted to
  <P="NPB~company~2~2"> The <p="DT"/> company <p="NN"/> </P>

The Collins Parser adds token-part of speech pairs `t/TRACE' and
`T/TRACE'. These tokens were not part of the original sentence
and are hence eliminated.
SOURCE has the output of the parser. The xmlized output is placed in
DESTINATION. parse2xml.pl has the following usage:

     USAGE :  parse2xml.pl OPTIONS DESTINATION SOURCE

      	--version       Prints the version number.
      	--help          Prints help message.


The script runs the program with the following options:

	parse2xml.pl parse.xml1 parse.txt


6.3.2 RETURN OF THE SENSEVAL TAGS (xml2sval.pl):
-----------------------------------------------

This program takes as input two files. SOURCE1 and SOURCE2 have 
the same tokens and in the same order. However, SOURCE1 and SOURCE2
might have different XML tags. `xml2sval.pl' places a copy of 
SOURCE2 in DESTINATION and adds the xml tokens from SOURCE1 at
corresponding positions relative to the tokens. Each
(non-xml) token in the SOURCE1 is compared with the corresponding token in 
SOURCE2. Any mismatch causes the relevant lines of the two
files to be printed in the file FILTER. This file is not created if 
there is perfect synchronization among the two files. The contextual 
alphabetic characters are converted to the lower case. 
Other functions include:

 a. Non-contextual data within square brackets is put in angular braces
 b. Non-contextual data within curly braces (found in Senseval-1 data)
    also placed in angular braces. An exception to this being typo
    error information in curly braces; the information is replaced
    by the correct spelling.
 c. Character references which were eliminated during pre-processing are
    put in angular braces.

It has the following usage:

	USAGE : xml2sval.pl [OPTIONS] DESTINATION SOURCE1 SOURCE2

The script runs the program with the following options:

	xml2sval.pl parse.xml ../sval2.xml parse.xml1

6.3.3 FILES FOR EACH LEXICAL ELEMENT: 
------------------------------------

Individual xml and count files for each lexical element may be
generated by processing the parsed Senseval-2 data format file generated 
above `parse.xml', with preprocess.pl. Following is what I have used:

preprocess.pl output.xml --nontoken $PARSESENSEVAL/non-token.txt 
  	      --token $PARSESENSEVAL/token2.txt --putSentenceTags

The xml and count files are created in the directory corresponding to
the lexical element in the OUTPUT directory:

	$OUTPUT/SPLIT/LexSample

6.3.4 Eliminating <s> and </s> Tags (newline.pl):
------------------------------------------------

`newline.pl' eliminates sentence delimiter tags in the SOURCE 
file(s) and places sentences on new lines. The sentence boundaries 
should be demarcated by <s>...</s> in the SOURCE file. 
The output is stored in the DESTINATION file.

        Usage : newline.pl.pl [OPTIONS] DESTINATION SOURCE

Each of the individual lexical element parsed files 
created by the preprocess.pl program, is processed with newline.pl.

7. MISCELLANEOUS SCRIPTS: 
========================


numparse.pl count parse.xml

Apart from the debug and testing files mentioned above, following
scripts can be used to check the validity and sanctity of the part of
speech tagged data produced.

head.pl : This script can be used to list all head instances in the
initial source file and the final part of speech tagged files. The
data should match both in number and what they hold. This is how I
have used it:

	head.pl SOURCE1 > DESTINATION1
	head.pl SOURCE2 > DESTINATION2

where,
	SOURCE1 is the file before any pre-processing by this script.
	SOURCE2 is the output file of xmlise.pl
	DESTINATION1&2 are the output files which list the head
        instances.
	
sat.pl : This script can be used to list all `sat' instances in the
initial source file and the final part of speech tagged files. The
data should match both in number and what they hold. This is how I
have used it:

	sat.pl SOURCE1 > DESTINATION1
        sat.pl SOURCE2 > DESTINATION2
where,
        SOURCE1 is the file before any pre-processing by this script.
        SOURCE2 is the output file of xmlise.pl
        DESTINATION1&2 are the output files which list the sat 
        instances.

The script runs the programs with the following options:

head.pl 1.txt > head-in.out
head.pl 7.txt > head-out.out
sat.pl 1.txt > sat-in.out
sat.pl 7.txt > sat-out.out

10. Copying: 
===========

This suite of programs is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.

Note: The text of the GNU General Public License is provided in the
file GPL.txt that you should have received with this distribution.

11. REFERENCES: 
==============

1. [Collins 99] M.Collins. Head-Driven Statistical Models for Natural 
Language Parsing. PhD thesis, University of Pennsylvania, 1999.

2. [Brill 92] E.Brill. A Simple Rule-Based Part of Speech Tagger. In
Proceedings of the Third Conference on Applied Computational
Linguistics, Trento, Italy, 1992.

3. [Brill94] E.Brill. Some Advances in Rule-Based Part of Speech
Tagging. Proceedings of the 12th National Conference on Artificial
Intelligence (AAAI-94), Seattle, WA, 1994.

4. [Brill95] E.Brill. Transformation-based error-driven learning and
natural language processing: A case study in part of speech tagging.
Computational Linguistics, 21(4):543-565, 1995.

5. [Mohammad Pedersen 2002] S.Mohammad and T.Pedersen. Guaranteed
Pre-Tagging for the Brill Tagger. In the Proceedings of the Fourth 
International Conference on Intelligent Text Processing and 
Computational Linguistics(CICLing-2003), in Fenruary 2003, in Mexico 
City.

6. [RamshawM94] L.Ramshaw and M.Marcus. Exploring the statistical 
derivation of transformational rule sequences for part-of-speech 
tagging. In ACL Balancing Act Workshop, pages 86-95, 1994.