Readme for Program compoundSval2.pl
-----------------------------------
December 18th, 2001
version 0.1
Copyright (c) 2001-2002
Satanjeev Banerjee, bane0025@d.umn.edu
Ted Pedersen, tpederse@d.umn.edu
University of Minnesota, Duluth
1. Introduction:
----------------
This program detects compound words in the text within the
tags of SENSEVAL-2 [1] lexical sample files, and connects
the components of such compound words with underscores. For example,
it replaces the sentence "the prime interest rate in new york city is
in high spirits" to "the prime_interest_rate in new_york_city is
in_high_spirits".
2. Running the program:
-----------------------
To acquaint oneself with the commandline syntax etc, one may run the
program without any arguments, like so:
perl compoundSval2.pl
This will output some help messages. More help can be obtained by
running the program like so:
perl compoundSval2.pl --help
The way to run this program is to provide it with a file containing
compound words, which we call the WORDSFILE, and a source lexical
sample file formatted in the SENSEVAL-2 format, which we call the
LEXFILE, like so:
perl compoundSval2.pl WORDSFILE LEXFILE
We provide a sample WORDSFILE, the file list_of_compounds.txt. This
contains all the compound words in the online lexical database,
WordNet [2]. We also provide a test xml file, test.xml, which is
formatted in the way the SENSEVAL-2 lexical sample files are usually
formatted.
The output of this program is sent to the stdout; all compound words,
as defined in the WORDSFILE are identified in this output.
3. Compound Identification:
---------------------------
Following are a few features of the compound identification algorithm:
1. Compound words are identified in a longest-compound-first greedy
manner. Thus in the sentence "this is the big bang theory", the
program would output "this is the big_bang_theory", although
"big_bang" is a compound too!
2. Sometimes, a choice needs to be made between two different
compounds. In these situations the compound found "first" is
chosen, in keeping with the greedy approach. Thus, in the sentence,
"what a work of art teacher", the word "art" can be part of one of
the two compounds "work_of_art" and "art_teacher". However, since
"work_of_art" comes "before" "art_teacher", it is chosen, and the
output of this program is "what a work_of_art teacher".
3. If the word within the
tags, called the "head-word",
is part of a compound word with components outside the
tags, then the entire compound becomes the new
head-word. For example, the sentence "this is my
art teacher", gets converted to "this is my
art_teacher".
4. Punctuation marks are ignored, and they do not come in the way of
finding compounds. Thus, "work of! art" would become
"work_of_art".
4. About File list_of_compounds:
--------------------------------
This is a file containing all the compound words in the online lexical
database WordNet [2]. This list has been obtained by searching for
words with an underscore in them in the files index.noun, index.verb,
index.adj and index.adv from the wordnet 1.7 installation.
4. Example Output:
------------------
For the input test.xml file:
this is a work of art. i love this work of art?
i am not your art teacher boy!
the prime interest rate in new york city is in high spirits and
art
this is a work of. art!
what a work of art teacher!
what a big bang theory. it begins with a big bang!
Following is the output file, when using the words file
list_of_compounds.txt:
this is a work_of_art i love this work_of_art
i am not your art_teacher boy
the prime_interest_rate in new_york_city is in_high_spirits and
art
this is a work_of_art
what a work_of_art teacher
what a big_bang_theory it begins with a big_bang
In instance "art.40001", we detect work_of_art twice. In the first of
the two, the compound becomes the new head word! In "art.40002", we
detect art_teacher, again becoming the new head word. Instance
"art.40003" is the standard example of multiple matches. We find
prime_interest_rate, new_york_city and in_high_spirits. Instance
"art.40004" demonstrates that puncutation marks are ignored. Instance
"art.40006" demonstrates the choice of "work_of_art" over
"art_teacher", and instance "art.40006" shows how "big_bang_theory" is
chosen, instead of just "big_bang".
--------
This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.