README
------
refine Version 0.2
Copyright (C) 2001-2002
Mohammad Saif, moha0149@d.umn.edu
Ted Pedersen, tpederse@d.umn.edu
University of Minnesota, Duluth
##################### LAST UPDATED: Dec, 2003 ###########################
`refine' is a software package intended to process data in Senseval-2
data format, to make it better suited as input to various systems.
It may be used to pre-tag the head words and/or restore split sentences
and/or place sentences on new lines.
############################# QUICK RUN #################################
#
# 1. Download and unpack `refine':
#
# http://www.d.umn.edu/~moha0149/research.html
#
# Set environment variable and path as shown:
#
# setenv REFINEHOME $REFPACKAGE/refine
# set path = ($REFINEHOME $path)
#
# ...where $REFPACKAGE represents the directory where the package
# has been unpacked.
#
# 2. Type:
# proc DATA 2 1 2 1 1 OUTPUT SOURCE
#
# DATA specifies the type of data being processed. This
# information is used by the script to select appropriate
# pre-tagging and/or split sentence files. It can take any of
# the following numeric values:
#
# 1 Senseval-1 training data in Senseval-2 data format
# 2 Senseval-1 evaluation data in Senseval-2 data format
# 3 Senseval-2 training data
# 4 Senseval-2 evaluation data
# 5 Any other data in Senseval-2 data format
#
# PLEASE SEE DETAILED USAGE (BELOW) FOR A BETTER USAGE OF THE SCRIPT
# AS PER YOUR NEEDS.
#
###########################################################################
################################ DETAILS ##################################
1. INTRODUCTION:
===============
The Senseval-2 exercise, held in July 2001, brought together numerous
word sense dismabiguation systems which were trained and tested on
one common data set and thereby one data format - the Senseval-2 data
format. Since then various other systems have been designed to accept
data in this format. The "line", "hard", "serve" and "interest" data
are also available in the Senseval-2 data format at the authors'
web pages. However, the data has certain irregularities such as
sentences split up be new line characters and certain other pairs of
sentences which are not separated by new lines. Various software tools
and packages such as parsers and part of speech taggers require the
data to abide by these conditions. Additionally, they may improve on
their accuracy if the head words in the data are pre-tagged with the
correct part of speech. This package aims at processing the data to
abide by these conditions. Information about the Senseval-2 data format
is available at its web page.
http://www.sle.sharp.co.uk/senseval2
The original Senseval-1, Senseval-2 and the fixed Senseval-1 data may
be downloaded from:
Senseval-2 :
http://www.sle.sharp.co.uk/senseval2/Results/guidelines.htm#rawdata
Senseval-1 :
http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html
Fixed Senseval-1:
http://www.d.umn.edu/~moha0149/research.html
2. SCRIPTS:
==========
A script `proc' is provided with this package which takes you through
the complete process of appropriately processing any data in Senseval-2
data format.
proc: To pre-tag the head words and/or restore split sentences
and/or place sentences on new lines.
Files with the necessary pre-tagging information have been provided
for the Senseval-1, Senseval-2, line, hard, serve and interest data.
Any other data in Senseval-2 data format may be pre-tagged as well
after the creation of their corresponding pre-tagging files. Details
in Section 6.1.
Files with the necessary split sentence restoration have been provided
for the Senseval-1 and Senseval-2 data. Split sentences in any other
data which is in Senseval-2 data format may be restored after the
creation of their corresponding restoration files. Details in Section
6.2.
If the source data has sentence boundaries marked with and
markers, these markers may be eliminated and each sentence placed
on a new line using this script. proc may also be used to detect
multiple sentences within a pair of new line characters and place
each sentence on a new line. Details in Section 6.3.
If contexts of certain instances are to be replaced with user
specified contexts, proc may be used to do so. The user specifies
the instances and contexts based on a CONTEXT file. Some of the
sentences in Senseval-1, Senseval-2 and "serve" data have very
long sentences (more than 120 tokens). Such sentences may be split
into two (posiibly more) sentences by manual inspection. Th epackage
provides CONTEXT files which replace contexts having long sentences
by contexts which have manually split long sentences. Details in
Section 6.4.
3. LOCATION OF FILES:
====================
On unpacking the package all files and data created are within the
`refine' directory. This directory is created in the directory
where the package was unpacked. Let REFINEHOME represent the
complete path to `refine'. The scripts and all the perl programs
as indicated by their `.pl' extension are located here. Various
other file which have information regarding the following are
also provided.
pre-tagging information : $REFINEHOME/data/PRETAG/
split sentence information : $REFINEHOME/data/SPLIT/
context replacement information : $REFINEHOME/data/REPLACE/
May place input files at : $REFINEHOME/data/user/
Output may be placed at : $REFINEHOME/data/output/
These files are described in the sections to follow. The change in
location of the above stated data files is possible but entails
appropriate changes in the scripts. All the perl programs assume the
Perl software to be at `/usr/bin'. If this is not the case requisite
changes will need to be made to the first line of the perl programs.
An alternative to changing the code for this purpose is to alias
`/usr/bin/perl' to the appropriate perl directory. For eg. if the
perl software is located in the `usr/local/bin/perl' directory, the
following command aliases `usr/bin/perl' to `/usr/local/bin/perl'.
alias /usr/bin/perl /usr/local/bin/perl
4. NECESSARY SOFTWARE:
=====================
The following software must be downloaded and their locations placed
in the PATH, in order to successfully part of speech tag data.
1. refine:
http://www.d.umn.edu/~moha0149/research.html
The complete path of the refine directory, created after
unpacking the package, needs to be set as an environment variable in
your .cshrc file. This is how I set it:
setenv REFINEHOME $REFPACKAGE/refine
...where $REFPACKAGE represents the directory where the package has
been unpacked. The $REFINEHOME directory must be added to
the PATH as well. This is how I did it:
set path = ($REFINEHOME $path)
5. REFINING THE DATA:
====================
The processing of data in Senseval-2 data format involves pre-tagging
the head words and concatenation of split sentences using the script
`proc'. If the part of speech tag usually associated with a
morphological form of a type is known, all instances of that type
which are head words can be pre-tagged with the associated part of
speech. By pre-tag we mean that the word is tagged with a part of
speech before the text is given to the tagger. Thus the tagger may use
this tag to influence its tagging of the surrounding words. Particular
instances of the head words may be pre-tagged to supersede the tag
based on morphology. The necessary files to pre-tag Senseval-1,
Senseval-2, "line", "hard", "serve" and "interest" data is provided
with this package. Pre-Tagging files corresponding to other data in
Senseval-2 data format may be created and used as well. Details in
Section 6.1.
The part of speech of the head words in Senseval-1 and Senseval-2 data
is known. Although finer distinctions such as common noun, past
participle etc are not given, the broader class such as Noun, Verb etc
is specified. This information or a refinement of the same may be used
to pre-tag the head words. We are also aware that head words in the
"line" data are nouns (NN), "hard" data instances are adjectives (JJ),
and "interest" data instances are nouns (NN). This pre-tagging
information is provided for the above mentioned data in the form of
pre-tagging files.
Senseval-1 and Senseval-2 data have certain sentences split up by
newline characters. Such sentences have been manually identified and
the relevant information stored in text files. Split sentence files
corresponding to other data in Senseval-2 data format may be created
and used as well. Details in Section 6.2.
The `proc' script uses these pre-tagging and split sentence files
to pre-tag the data and/or restore split sentences. Following is its
usage:
USAGE : proc DATA TAG RESTORE NEWLINE REPLACE UNICODE OUTPUT SOURCE
DATA specifies the type of data being processed. This
information is used by the script to select appropriate
pre-tagging and/or split sentence files. It can take any of
the following numeric values:
1 Senseval-1 training data in Senseval-2 data format
2 Senseval-1 evaluation data in Senseval-2 data format
3 Senseval-2 training data
4 Senseval-2 evaluation data
5 Any other data in Senseval-2 data format
TAG is used to specify the level of pre-tagging to be done.
It can take any of the following numeric values:
1 Pre-Tagging based on head word Morphology
2 Pre-Tagging based on head word Morphology superseded
by specific instance Pre-Tagging
0 No Pre-tagging
Information for morphology based Pre-Tagging of all head words
of Senseval-1, Senseval-2, line, hard, serve and interest data
is provided with this package.
Please see Section 7.1 for information on how to customize.
Information for specific instance Pre-Tagging of all
capitalized noun head words of Senseval-1 and Senseval-2 data
is provided with this package.
Please see Section 7.1 for information on how to customize.
RESTORE specifies if any split sentences are to be restored.
It can take any of the following numeric values...
1 Restore split sentences
0 No split sentence restoration
Information for split sentence restoration of Senseval-1 and
and Senseval-2 data provided with this package.
Please see Section 7.2 for information on how to customize
NEWLINE specifies if sentences are to be placed on new lines.
1 Eliminate sentence boundary markers and
(if present) and place the sentence on new line.
2 In addition to 1 stated above, check text within
every pair of new line characters to see if
consisting of multiple sentences. If yes, place
each sentence on a new line.
0 No action
REPLACE specifies if contexts of certain instances are to be
replaced with modified/new contexts. The instances whose
contexts are to be replaced and the modified/new contexts are
specified via a CONTEXT file.
1 replace contexts
0 No action
UNICODE specifies if special unicode characters are to be
replaced by appropriate xml representations as listed in the file
unicodemap.txt.
1 replace unicode characters
0 No action
OUTPUT is the name of the directory where all the output files
are to be created. The directory specified will be created by
the script and should not be existing already.
SOURCE is the name of the Senseval-2 data format file to be part
of speech tagged. A complete or relative path where the file exists
may be specified.
Following files are placed in the OUTPUT directory:
1.txt : A copy of the SOURCE file.
pretag.xml : The pre-tagged SOURCE file.
Created only if pre-tagging is requested.
join.xml : Split sentence restored SOURCE file. Created only
if split sentence restoration requested.
marked.xml : The file created on using 1 as the NEWLINE option
multisent.xml : The file created on using 2 as the NEWLINE option
replace.xml : The file created on using 1 as the REPLACE option
unicode.xml : The file created on using 1 as the UNICODE option
Pre-tagged Senseval-1 and Senseval-2 data files with split sentence
restoration are available at the authors' web page.
Senseval-1:
Training file : pretag-train-S1.xml
Test file : pretag-test-S1.xml
Senseval-2:
Training file : pretag-train-S2.xml
Test file : pretag-test-S2.xml
These file have been created using the following options respectively:
proc 1 2 1 2 1 1 DESTINATION SOURCE
proc 2 2 1 2 1 1 DESTINATION SOURCE
proc 3 2 1 2 1 1 DESTINATION SOURCE
proc 4 2 1 2 1 1 DESTINATION SOURCE
6. REFINE DETAILS:
=================
Following is a description of the various perl programs used to
process data in Senseval-2 data format.
6.1 PRE-TAGGING OF HEAD WORDS (pretag.pl):
-----------------------------------------
`pretag.pl' pre-tags the head words of Senseval-2 data in a format
acceptable by the Brill Tagger. Its usage is as follows:
Usage: pretag.pl [OPTIONS] TAGS DESTINATION SOURCE [SOURCE]...
OPTIONS:
--help prints the help message
--PNouns PNOUNS used to tag specific instances of the head
word. Instance information stored in file PNOUNS.
The program pre-tags the head words in the SOURCE file(s) with part of
speech tags based on morphological form and the broad part of speech
class already known from the data.
TAGS refers to the file
holding this information. This package provides files for the
Senseval-1 and Senseval-2 evaluation and training data, the "line",
"hard", "serve" and the "interest" data.
The TAGS file is analogous to the LEXICON file of the Brill
Tagger. Every line in the text is treated as a separate entry. Each
entry consists of a particular type followed by its most likely part
of speech tag. Data in Senseval-2 data format
has an xml tag which provides the lexical task name and the part of
speech of the following instances. For example:
Here `art' is the name of the lexical task, while, the following
instances of the task have the head words in the noun form - indicated
by `n'. Given that the instances are nouns, only those morphological
form entries in TAGS are considered which have a noun part of speech
associated with them. Once a match in the surface form is found, the
associated part of speech is chosen.
Every token within the head tags is considered for
pre-tagging. If there exists an entry for the head word in the TAGS
file, then the head word is pre-tagged with the associated most likely
tag. It may be noted that head words with apostrophe are first
tokenized in the following manner by pretag.pl before pre-tagging.
band's
tokenized to...
band 's
pre-tagged to...
band//NN 's
In the TAG files provided, no entries for `'' exists hence it is not
pre-tagged. It is left to the tagger to appropriately part of speech
tag the apostrophe.
The names of the pre-tagging files provided with this package are as
follows:
Senseval-1 training data : train-types-S1.txt
Senseval-1 evaluation data : test-types-S1.txt
Senseval-2 training data : train-types-S2.txt
Senseval-2 evaluation data : test-types-S2.txt
Other data : types-O.txt
Location : $REFINEHOME/data/PRETAG
Entries in the `types-O.txt' correspond to the "line", "hard", "serve"
and "interest" data. This information may be supplemented or replaced
to deal with other data.
All head word instances with the same morphological form are tagged
alike by the default option. Specific instances may have their pre-tag
superseded by another user specified pre-tag, using the PNoun option.
The pre-tag to be assigned to the instance and line number of the head
word in the data file are to be specified in the PNOUNS file. The
package provides PNOUNS files with encoded information for all
capitalized noun head words in the Senseval-1 and Senseval-2 data. The
context of these instances has been manually examined to determine the
most suitable part of speech of the head word.
The first token in every line is the instance being pre-tagged. This
information is not used for pre-tagging. It is followed by its line
number in the Senseval-2 data format file and the pre-tag. If no
pre-tag is specified, a common noun(NN) is chosen by default. Question
marks before the line number signify possibility of alternate pre-tag
being appropriate. Again, the question marks do not affect processing.
The names of the files provided with this package are as follows:
Senseval-1 training data : train-nouns-S1.txt
Senseval-1 evaluation data : test-nouns-S1.txt
Senseval-2 training data : train-nouns-S2.txt
Senseval-2 evaluation data : test-nouns-S2.txt
Other data : nouns-O.txt
In case of no specific
instance pre-tagging : dummy.txt (blank file)
Location : $REFINEHOME/data/PRETAG
The `nouns-O.txt' is presently blank. It may be updated to do specific
instance pre-tagging in data other than the Senseval-1 and Senseval-2
English Lexical Sample Space.
The script runs the program with the following options:
pretag.pl -PNouns $REFINEHOME/data/PRETAG/$nouns \\
$REFINEHOME/data/PRETAG/$types pretag.xml 1.txt
6.2 CONCATENATING SPLIT SENTENCES (join.pl):
-------------------------------------------
A pre-requisite in using the Brill tagger is that parts of sentences
should not have new line characters between them. The Senseval-1 and
Senseval-2 data do not adhere to this requirement completely and other
sources of data could have the same problem. `join.pl' may be used to
concatenate the split sentences.
Usage: join.pl [OPTIONS] LINES SOURCE DESTINATION
OPTIONS:
--help Prints the help message.
The program concatenates lines of the SOURCE file as specified by
their line numbers listed in the LINES file. `LINES' files containing
this information for Senseval-1 and Senseval-2 data constructed by
manual inspection are provided with this package. The output of the
program is the DESTINATION file. The names of the files provided with
this package are as follows:
Senseval-1 training data : train-lines-S1.txt
Senseval-1 evaluation data : test-lines-S1.txt
Senseval-2 training data : train-lines-S2.txt
Senseval-2 evaluation data : test-lines-S2.txt
Other data : lines-O.txt
Location : $REFINEHOME/data/SPLIT
The `lines-O.txt' is presently blank. It may be created to do split
sentence restoration in data other than the Senseval-1 and Senseval-2
English Lexical Sample Space.
The script runs the program with the following options:
join.pl $REFINEHOME/data/SPLIT/$lines pretag.xml join.xml
6.3 PLACING SENTENCES ON NEWLINES (mark.pl and multisent.pl):
------------------------------------------------------------
As mentioned above, we would like the tagger to receive a text
file with all sentences on new lines. Certain data files might
have sentence boundaries marked by the and markers
and may have multiple sentences on the same line. mark.pl and
multisent.pl are perl programs which handle these two cases
respectively.
mark.pl eliminates sentence boundary markers and
(if present) in the SOURCE file. It then places each sentence
on a new line. The output is placed in the DESTINATION file.
Usage: mark.pl [OPTIONS] DESTINATION SOURCE
OPTIONS:
--help Prints the help message.
The script runs the program with the following options:
mark.pl marked.xml join.xml
multisent.pl detects multiple sentences in each line of the
SOURCE file and places each sentence on a new line. The output
is placed in DESTINATION. The program assumes non-tokenized
input.
Usage: multisent.pl [OPTIONS] DESTINATION SOURCE
OPTIONS:
--help Prints the help message.
The script runs the program with the following options:
multisent.pl multisent.xml marked.xml
6.4 REPLACING CONTEXTS OF SPECIFIC INSTANCES (replace.pl):
---------------------------------------------------------
This program may be used to replace the contexts of certain instances
in a senseval-2 data format SOURCE file with modified/new contexts.
The instances whose contexts are to be replaced and the modified/new
contexts are specified via a CONTEXT file. The CONTEXT file must have
the instance ID of the instance whose context is to be replaced followed
by the modified/new context. The context must be demarcated by
and tags. The instance ID, and <\/context> tags must
be on new lines. Blank lines are allowed, however, everything
between the context tags is considered part of context.
The token(s) between the and tags of the context in the
CONTEXT file are replaced by correponding tokens in the SOURCE file.
If the original context has its head word pre-tagged with a part of
speech, the updated context will thus have the pre-tag.
The updated SOURCE is placed in the DESTINATION file.
USAGE : replace.pl OPTIONS CONTEXT DESTINATION SOURCE
OPTIONS:
--version Prints the version number.
--help Prints help message.
If contexts of certain instances are to be replaced with user
specified contexts, proc may be used to do so. The user specifies
the instances and contexts based on a CONTEXT file. Some of the
sentences in Senseval-1, Senseval-2 and "serve" data have very
long sentences (more than 120 tokens). Such sentences may be split
into two (posiibly more) sentences by manual inspection. Using
replcae.pl contexts having such long sentences may be replaced
with contexts having manually split sentences.
CONTEXT files corresponding to the Senseval-1, Senseval-2 test and
training data and "serve" data are provided with this package. These
filese are named:
Senseval-1 training data : train-rep-S1.txt
Senseval-1 evaluation data : test-rep-S1.txt
Senseval-2 training data : train-rep-S2.txt
Senseval-2 evaluation data : test-rep-S2.txt
Other data (serve data) : rep-O.txt
Location : $REFINEHOME/data/REPLACE
The `rep-O.txt' contains information for the "serve", "hard",
"line" and "interest" data. It may be supplimented with
information for any other data as long as the instance IDs are
unique.
The script runs the program with the following options:
replace.pl CONTEXT replace.xml multisent.xml
6.5 REPLACING UNICODE CHARATERS WITH XML TAGS (unicode2xml.pl):
---------------------------------------------------------------
unicode2xml.pl replaces special unicode characters as listed in the file
UNIMAP by appropriate xml tags which are also listed in UNIMAP.
UNIMAP is to have one entry per unicode character. Each entry is on
one line and has the unicode character, white space and the corresponding
xml representation, in that order.
USAGE : unicode2xml.pl [OPTIONS] UNIMAP DESTINATION SOURCE
This option is provided specially to handle certain occurrences of
unicode characters in Senseval-2 data which should have been
represented by xml tags, as is the format of Senseval-2 data.
A default UNIMAP file which caters to Senseval-2 data is provided
with this package (unicodemap.txt). If the data being used is
different and has unicode characters, appropriate UNIMAP file
must be created and copied to $REFINEHOME/unicodemap.txt.
The script runs the program with the following options:
unicode2xml.pl $REFINEHOME/unicodemap.txt unicode.xml replace.xml
7. Copying:
===========
This suite of programs is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
Note: The text of the GNU General Public License is provided in the
file GPL.txt that you should have received with this distribution.
8. REFERENCES:
==============
1. [Brill 92] E.Brill. A Simple Rule-Based Part of Speech Tagger. In
Proceedings of the Third Conference on Applied Computational
Linguistics, Trento, Italy, 1992.
2. [Brill94] E.Brill. Some Advances in Rule-Based Part of Speech
Tagging. Proceedings of the 12th National Conference on Artificial
Intelligence (AAAI-94), Seattle, WA, 1994.
3. [Mohammad Pedersen 2002] S.Mohammad and T.Pedersen. Guaranteed
Pre-Tagging for the Brill Tagger. In the Proceedings of the Fourth
International Conference on Intelligent Text Processing and
Computational Linguistics(CICLing-2003), in Fenruary 2003, in Mexico
City.